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r .89. Lek first compares all proteins in- the proteome to s * 

•■ one another. Next, the resulting' BLAST reports are 
; ... \ parsed, and a graph is created wherein each protein 
. constitutes a node; any hit between two proteins 
with an expectation beneath a user-specified 
threshold constitutes an edge. Lek then uses this 
graph to compute a similarity between each protein 
pair ij in the context of the graph as a whole by 
simply dividing the number of BLAST hits shared In 
common between the two proteins by the total 
number of proteins hit by / and /. This simple metric 
has several interesting properties. First because the 
similarity metric takes Into account both the simi- • 
larity and the differences between the two sequenc- 
es at the level of BLAST hits, the metric respects the 
multidomain nature of protein space. Two multido- 
mafn proteins, for instance, each containing do- 
mains A and B, will have a greater pairwise similarity 
to each other than either one will have to a protein 
containing only A or B domains, so long as A-B- 
containing multidomain proteins are less frequent in 
the proteome than are single-domain proteins con- 
taining A or B domains. A second Interesting prop- 
erty of this similarity metric is that it can be used to 
produce a similarity matrix for the proteome as a 
whole without having to first produce a multiple 
alignment for each protein family, an error-prone 
and very time-consuming process. Finally, the met- 
ric does not require that either sequence have sig- 
nificant homology to the other in order to have a 
defined similarity to each other, only that they 
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nome would open up new strategies for hu- 
fcman biological research and would have a 
"major impact on medicine, and through med- 
■■' ;; icine and public health, on society. Effects on 
; ; biomedical research are already being felt. 

This assembly of the human genome se- • 
: quence is but a; first, hesitant step on a long 
... ;■. : . and ; ,exciting journey . toward understahding 
the role of the genome in human biology It 
has been possible only because of innova- 
. tions in instrumentation and software that 
. .have allowed automation of almost every step 
. of the process from DNA preparation to an- 
notation. The next steps are clear: We must 
. . !; define the complexity, that ensues when this 
-v relati vely modest set of about 30,000 genes is 
, expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- ; 
• ; istry, physiology, and ultimately phenotype ' 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of. understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
„ .. between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
^» ublic discussion of this information and its 
^^potential for improvement of personal health. 
.Many diverse sources of data have shown ' 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
: are "hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of em- 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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... , :< ... ; r .:cleave the ^vector. I but generally /tfeaved /several 
. . times within the 50-kbp insert A 1264^bp' Bani H\ 
rkanamycin resistance .^cassette • . (purified .;from 
. :.../.■ pUCK4; Amersham Pharmacia, catalog rib. 27-4958- 
01) was added and ligation.was carried out at 37°C 
.in the continual presence of Bgl II. As Bgl ll-Bgl II 
ligations occurred, they were continually cleaved, 
-whereas Bam Hl-Bgl II. ligations were not cleaved. A 
].r. high yield of internally, deleted circular .library mo\- ' 
. x • ecules .was obtained in which ?trie residual insert 
ends.-.were .separated by the ckanamycin. cassette 
J . : v'j" PNA. The : internally deleted libraries, when plated 
..... .}., ,on agar containing ampicillin (50 u.g/ml), carbeni- 

, . V y .5 cillin (50 ng/ml), and kanamycin (15 ng/ml), pro- : 

: /Educed relatively uniform large colonies. The result- 
; >; ? in g clones could be prepared jor sequencing using 
Vv Vy\ the same procedures as clones from the 10-kbp 

libraries. • 
. 34. Transformed cells were plated on agar diffusion 
.>-, plates prepared with a fresh top layer containing no 
. : ...... , antibiotic poured on top of a previously set bottom 

■ f layer , containing excess antibiotic, to achieve the 
correct final concentration.. This method of plating 
permitted the cells to develop antibiotic resistance 
before being exposed to antibiotic without the po- 
tential clone bias that can be introduced through 
liquid outgrowth protocols. - After colonies had 
grown. QBot (Genetix, UK) automated colony-pick- 
ing robots were used to pick colonies meeting strin- : 
- gent size and shape criteria and to. Inoculate 384- 
well microtiter plates containing liquid growth me- 
dium. . Liquid, cultures were .incubated overnight, 
with shaking, and were. scored for growth before 
. passing to template preparation. Template DNA was 
t extracted from liquid bacterial culture using a pro- 
.. . cedure based upon the alkaline lysis miniprep meth- 
od (773) adapted for high throughput processing in 
384-well microtiter plates. Bacterial cells were 
lysed; cell debris was removed by centrifugation; 
. , .y< and plasmld DNA was recovered by isopropanol 
: , . precipitation t and .res uspended Jn .10 mM tris-HCl 
buffer. Reagent dispensing operations were accom- 
plished using Titertek MAP 8 liquid dispensing sys- 
tems. Plate-to-plate liquid transfers were performed 
using Tomtec Quadra 384 Model 320 pipetting ro- 
bots. All plates were tracked throughout processing 
by unique plate barcodes. Mated sequencing reads . 
from opposite ends of each done insert were ob- 
tained by preparing two 384-well cycle sequencing 
reaction plates from each plate of plasmld template 
DNA using ABI-PRISM BigDye Terminator chemistry 
(Applied Biosystems) and standard M13 forward 
and reverse primers. Sequencing reactions were pre- 
pared using the Tomtec Quadra 384-320 pipetting 
robot Parent-child plate relationships and, by ex- 
tension, forward-reverse sequence mate pairs were 
established by automated plate barcode reading by 
the onboard barcode reader and were recorded by 
direct LIMS communication. .Sequencing reaction 
products were purified by alcohol precipitation and 
were dried, sealed, and stored at 4°C in the dark 
until needed for sequencing, at which time the 
reaction products were resuspended In delonlzed 
formamide and sealed Immediately to prevent deg- 
radation. All sequence data were generated using a 
single sequencing platform, the ABI PRISM 3700 
DNA Analyzer. Sample sheets were created at load 
time using a Java-based application that facilitates 
barcode scanning of the sequencing plate barcode, 
retrieves sample information from the central LIMS. 
and reserves unique trace identifiers. The applica- 
tion permitted a single sample sheet file In the 
linking directory and deleted previously created 
sample sheet files Immediately upon scanning of a 
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types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
. , such studies have suggested that modem human 
. . lineages derive .from Africa, many important 
: questions regarding human origins remain un- 
( s -answered, . and more analyses -using detailed 
SNP maps will be needed to settle these con- 
\ troversies. In addition to providing evidence for 
. population .expansions, , migration, ; and adrnix-i 
ture, SNPs. can serve as markers for the extent 
of evolutionary constraint acting on particular 
genes. The correlation between patterns of in-. 
:' traspecies and ; interspecies : genetic ..variation 
may prove to_ be. especially informative to iden- . 
tify sites, of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated . 

The remarkable heterogeneity in SNP 
density . implies, .that there are a ; variety of 
. forces acting on polymorphism— sparse re- j 
gions may, have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of.muta-,. 
tions that are tolerated, or because recent 
: strong selection in favor, of a newly .arisen ; 
allele "swept" the linked variation out of the 
population {165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
, one-quarter as many Y chromosomes in the,' 
population as. there are autosomal chromo-" 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X ; 
chromosome has a smaller effective popu- ? 
lation size than the autosomes, and. its. nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of - 
high density of deleterious mutations will 
see a greater rate of elimination by . selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
ila, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. ' 



8.4 Genome complexity 

We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



then docks on this, and then the. complex 
moves there. . . (167) to the exciting area 
, , of ; network .perturbations, . nonlinear re- 
sponses and, -thresholds, :.and .their .pivotal 
role in human diseases. 
; ; ^e ;eniimeration : of other. "parts lists" ' re- 
v: v.yeals that .in .organisms .with complex nervous 
systems,^ neither gene number, neuron number, 
'< i nor number of , cell .Jtypes ;corfelatesVin .any 
meaningful manner with even simplistic mea 
sures of structural or behavioral complexity 
Nor would they be. expected to; this is the realm 
i^of nonlmearities .andepigenesis (168). The 520 
million neurons f of the common octopus exceed 
i; the neuronal number in the brain of a mouse by 
■an order of magnitude. It is apparent from a 
. .comparison, of genomic data on the mouse and 
■ ; human, and from.qomparadve mammalian neu- 
roanatomy (169), that the morphological and 
f. : behavioral diversity found V mammals is un- 
derpinned by a similar gerie'repiertoire and sim-; 
^ neuroanatomies.: For;> example, .when one : 
compares a pygmy marmoset (which is only 4 
>; : inches; tall . and weighs .about 6 . ounces) to a : 
J ..chimpanzee, the brain volume of. this minute 
. ; primate is found, to be only about 1 .5 cm 3 , two 
v, orders of magnitude less than that of a chimp 
. and three . orders less than that of humans. Yet 
; the neuroanatomies of all three brains are strik- 
. ingly . similar, and.the behavioral characteristics 
of the pygmy marmoset are little different from 
• those of chimpanzees. Between humans and 
chimpanzees, the gene number, gene structures 
and functions, chromosomal and genomic or-. 
. , : ganizations, and cell types and neuroanatomies \ 
■ . .are almost . indistinguishable, yet the develop- 
mental ; modifications, that predisposed human 
. lineages to cortical expansion and development .-, 
of the larynx, giving rise to language, culrninat- . 
ed in a massive ..singularity that by even the . 
simplest of . criteria made humans more com- 
~plex in a behavioral sense.— - — .r;::^ 
. ;..>; Simple examination of the numberof neu- ; 
rons, cell .types, or genes or of the genome * 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-0, ephrins, and connexins), as well 
as. nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few. proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8.5 Beyond single components 

While few would disagree with the intuitive 
■ conclusion .'that -'Emstem's .brain was more 
- complex than that of Prosophild, closer com- 
parisons such as whether the set of predicted^ 

ie™ 



/human- proteins "Js^mbreVcomplex than the 
v-protein-set^of t jp/^0pA//(a;-and if so, to what 

• i - degree,, are, not straightforward, 1 since protein, 
protein, domain,- or.pro tern-protein interaction , 
measures do not capture context-dependent 

^interactions that underpin the/ dynamics un- 
v deriving phenotype. 

v Currently, there are more" than 30 different 
mathematical descriptions, of complexity (170) 

• . However, we have yet to understand the math- ; 
, ematical dependency, relating • the number of 
,, genes with organism complexity. One pragmat- 
v vie, approach .to the analysis.of biological sys- 

terns, which are composed of nonidentical ele- 
: f oments (proteins, 1 protein complexes, mteracting 
V ■ v cell \ types,: - and .interacting ? neuronal popula- 
^ l.tions),.is to^ ele- \- 

A: ments ofthe system can be represented by the 
^vertices of complex topographies, with the edg- 
Vv ;es -representing the interactions between them. 
/ Examination of large networksreveals that they 
. can self-organize, but more important, they can 
. \ be particularly robust. This robustness is not 
due to redundancy, but is a property , of inho- 
mogeneously. wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to . 
network stability. .Gene knockouts provide . an . 

• illustration. ;Some knockouts may have minor 
-."effects, whereas others have catastrophic effects 
■v '■- on the system. :In the case of vimentin, a sup- 

• posedly critical component of the cytoplasmic 
, intermediate filament network of mammals, the 
■*■ knockout of the gene in mice reveals them to be 

reproductively normal, with no obvious pheno- 

• - typic effect? (1 72); and yet ;m*e"usually conspic- 
: uous; vjrnentin :;networki is - completely absent 

On the other hand, ; -30% : of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or "bad" genes; but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity " particularly because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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. quence.well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
vpeat regions. Likewise, in Drosophila, the 
) BAC physical map was most useful in re- 
< , gions near the highly repetitive centromeres 
and telomeres. WGA.has been found to de- 
liver excellent-quality reconstructions of the 
i unique regions of the genome. As the genome ; 

• • size, and more importantly the repetitive con-; 
.tent, increases, the WGA approach delivers, 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
: as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific applica- 
tions of BAC-based or other clone mapping and 
■ sequencing strategies to resolve ambiguities in, 
r sequence assembly that .cannot be efficiently % 
resolved with computational approaches alone ; 
are clearly worth exploring. Hybrid approaches 

• to whole-genome sequencing will only work if . 
there is sufficient coverage in both the whole- 

. genome, shotgun phase and the BAC clone se- , 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3X coverage of both whole-genome and 
BAC shotgun sequence data. 

8.2 The low gene number in humans 

We have sequenced and assembled -95% of 
the euchrpmatic sequence , of H. sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the ; 
human genes. This has provided a major sur- • 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever , 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- .. 
nome), and careful molecular dissection of 
complex "phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5 '-untranslated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(753); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



/^predicting genes should limit.this number. As 
-was .true at the beginning of genome sequenc- 
i r-^ing, ultimately.it will be necessary, to measure 
■i :. >mRNA in . specific cell types to demonstrate 
; . the presence of a gene. . . 
„ A J. B^S. Haldane speculated in 1937ithat a 
^population of organisms might have to pay a 
,v price for ; the.number of genes it can possibly 
% carry He theorized that when the number of 
> genes becomes too.large, each zygote carries 
■\l so many new deleterious mutations that the 
. population simply cannot maintain itself. On 
. the basis of this premise, and on the basis of 
.. available mutation rates, and x-ray-induced 
mutations at specific loci, Muller, in 1967 
(154), j calculated that .the mammalian ge- 
v ; nome would contain a maximum of not much • 
• v more than 30,000 genes (155). An estimate of 
, 30,000 gene loci for humans was also arrived ? 
; a at . by Crow and Kimura (755). ;Muller's esti-% 
v- mate for jD. melanogaster was 1 0,000 genes/: 
•..compared to 13,000 derived by: annotation of • 
: i the fly genome (26", \?7). :These arguments for 
; the theoretical maximum gene number were 
based on . 'simplified ideas of genetic load — 
that all \ genes . have a certain low rate of 
.mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. 
; / The ,, modest - number of human genes - : 
. means that we must look elsewhere' for the : 
x .mechanisms .that generate .the complexities 
..inherent in human development and the so- : t 
phisticated signaling systems that maintain 
homeostasis. .There are a large number of 
ways in which the functions of individual ; 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure - 
v^and hence transcriptional activity is regulated > 
I by : protein complexes that involve • histone ■ 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally, important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (157); meth- 
ylation of CpG islands in imprinting (158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The splicepsomal . 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



of RNA editing in which coding changes 

V occur directly at the' level of mRNA is of 
; ^ clinical and biological relevance (161). Final- 

ly, ' examples of translational control include 
internal ribosomal entry sites that are found 
- : i : m proteins involved in cell cycle regulation 
and apoptosis ; (767);vAt the -protein ; level, 
, v mmor . alterations in the (nature of protein- 
protein ;> interactions, ? : protein vmodificatidns, 
* s -;|..and localization can have ! dramatic effects on 
cellular physiology (163):{Uds dynamic sys- 
• tern therefore has many, .ways = to modulate 
activity, which suggests .that 'definition of 
, complex systems by analysis of single genes 
v .is unlikely to be entirely successful. 
. .In situ studies have shown that the human 
genome . is / asymmetrically: ( populated i with • 
k ;;G+Ocontent,vCpG; islands, and genes (68). 
w.However^the' genes are not distributed quite 
f.; as unequallyVas.had been- predicted (Table 9) 
• ; (69). -The most' G+C-rich fraction of the ge- 
nome, H3 .;isochores,- constitute :more of the 
v. genome than previously thought (about 9%), 
, ; and are the most gene-dense , fraction, but 
. . contain only 25% of the genes, rather than the 
- predicted —40%. The low G+C L isochores 
. make up 65% of the genome, and 48% of the 
: ; genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
: cation, has . been described as; the "desertifi- \ 

V cation'* of the ; vertebrate; genome (71). Why 
vare, there .clustered regions of;high and low 
k gene density > and are . these accidents of his- 
l \ tory or driven by selection and evolution? If 

these deserts are dispensable, it ought to be 
. possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many , species of bats have genome 
r. sizes .that are much smaller than that of hu- 
l mans; for example, Miniopterus, a species of 
-Italian bat, has a genome size that is only 
50% that of humans (164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is —70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed. Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (150). Evolution of apparently new ; 
. - (from the perspective of sequence analysis) 
: . protein, domains and increasing . regulatory : 
7: complexity by domain accretion both quanti- ' 
tatively and qualitatively (recruitment of nov- . 
el domains with preexisting ones) are two 
-features that we. observe in humans. Perhaps ''"!. 
the best illustration of this trend is the C2H2 
zmc finger-containing transcription factors;' 
. where we see expansion in the number of 
domains per protein, together, with verte- 
. brate-specific domains such as KRAB and 
SCAN. Recent reports on the prominent use 
of internal ribosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify the full ■ ' - 
extent of this process in the human genome 
(757). At the posttranslational level, although 
we provide examples of expansions of some - 
protein families involved in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate, whether this is correlated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isoform generation in the human 
remain to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. 
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Panther family/subfamily* 



H 



•C2H2.zinc.finger-containinet 
COE 

CREB 

ETS-related 
.« forkhead-related 

L Fbs 

Groucho . . • \ - ; '- 
Histone HT ' 
Histone H2A 
Histone H2B 
Histone H3 .■ 
Histone H4 .■ 
Homeoticf 
ABD-B 
Bithoraxoid 
: Iroquois class 
Distal-less 
Engrailed 
.: LIM-contalning 
: MEIS/KNOX class 
NK-3/NK-2 class 
Paired box 
Six 

Leucine zipper 
Nuclear hormone receptorf 
Pou-related 
Runt-related 



} f Transcription factors/chromatm organization 



8 Conclusions 

8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of DCt ^ 
genome sizes Mid repeat content allows us to^ Cajp_a!n 
assess its strengths" and weaknesses. With the 
success of the method for a large number of 
microbial genomes, Drosophila, and now the 
• human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (75, 80, 152) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosophila or 
human, map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds mto chromosomes, the quality of the 
map (in terms of the order of the markers) is 
more important than the number of markers 
per se. Although this mapping could have 
been performed concurrently with sequenc- 
ing, the prior existence of mapping data was 
beneficial. During the sequencing of the A. 
thaliana genome, sequencing of individual 
BAC clones permitted extension of the se- 



Cadherin : 
Claudin 

Complement receptor-related 
Connexin 
Calectin 
Glypican 
ICAM 

Integrln alpha 
Integrin beta 
LDL receptor family 
Proteoglycans 

Bcl-2 



Calpain inhibitor 
Caspase 

ADAM/ADAMTS 
Fibronectin 
Globin 

Matrix metalloprotease 
Serum amyloid A 
Serum amyloid P (subfamily of 

Pentaxin) 
Serum paraoxonase/arylesterase 
Serum albumin 
Transglutaminase 



Cytochrome p450 
GAPDH 

Heparan sulfotransferase 
EF-1alpha 

Ribonucleoproteinsf 
Ribosomal proteins} 
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Apoptosis 
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0 


13 


7 


3 


Hemostasis 
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Other enzymes 
60. 89 
46 3 
11 4 
Splicing and translation 
56 13 
269 135 
812 111 
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8 
0 

13 
265 
256 



www3dencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1345 



. fclytic, activity, as a uracil DNA glycosylase * ; 
' r (140) and functions as a cell cycle regulator f 
v (14 J) and has even been implicated in apo- i- 
- ptosis (142), . 
Translation. Another striking set of hu-/ 
man expansions has occurred in certain fam-.. 
..ilies involved in the translational machinery! ; 
We identified 28 different ribosomai subunits y\ 
that each have at least 10 copies in the ge- , 
- nome; on average, for all ribosomai. proteins; ; -( 
.; there ; is about an 8-. to 10-fold expansion in ;i 
the number of genes relative to either the 
. worm or fly. Retrotransposed pseudogenes £ 

Table 19 (Continued) 



The human genome 

; may r account for many of these ..expansions 
•y: [see the discussipn aboye and (143)] / Recent 
^ evidence suggests that a number of ribosomai 
vrproteins have secondary functions indepen- 
t - dent, of . mek:mvplvement in protein biosyn- 
^.thesis; for example, L13a and the related L7 
; : ,$ubunits . (36 copies; in ^humans) have been 
> shown to induce appptosis (144), 
J: ; ;;^?^ere:is ^so. a : fourr ; to "fivefold expansion 
in the elongation , factor 1 -alpha family 
(eEFlA; 56 human genes) Many of these 
^expansions likely, represent intronless para- 
Jogs to presumably arisen from retro- 



: - Panther family/subfamily* 


. H 


■V - F ■ 


W 


Y : 


A 


MHC class! 


22 


0 


0 


0 .-. 


,0 


MHC class II 


20 


0 


0 


0 


0 


Other immunoglobulin! 


114 


0 


0 


0 


0 


Toll receptor-related 


10 


6 


0 


0 


0 


; • .-. .... Developmental and homeostatic regulaU 


ors 






Signaling moleculesf 












Calcitonin 


3 


0 


0 


0 


0 


Ephrin 


8 


2 


4 


0 


0 


FGF 


24 


1 


1 


0 


0 


Glucagon 


4 


0 


0 


0 


0 


Glvcoorotein hormone hpta rhain 


2 


0 


0 


0 


0 


Insulin 


1 


0 


0 


0 


0 


Insulin-like hormonp 


3 


• 0 


0 


0 


0 


Nerve prowth factor 


3 


0 


0 


0 


0 


Neurepulin/herfpulin 


6 


0 . 


0 


0 . 


0 


. neuropeptide Y 


. 4 


0 


0 ■ ; . 


0 


-• 0 


PDGF 


1 


1 


0 


: 0 


0 


Relaxin 


3 


0 


0 


o 


0 


Stannocalcin 


2 


0 


0 


0 


0 


Thymopoeitin 


2 


0 


1 


0 


0 


Thyomosin beta 


4 


2 


0 


0 


0 


TGF-p 


29 


6 


4 


0 


0 


VEGF . 


4 


0 


0 


0 


0 


Wnt 


- - - 18 " - 


6 


S' 


o 


6"-- 


Receptorsf 












Ephrin receptor 


12 


2 


1 


0 


0 


FGF receptor 


4 


4 


0 


0 


0 


Frizzled receptor 


12 


6 


c 
-> 


u 


u 


Parathyroid hormone receptor 


2 


0 


0 


0 


0 


VEGF receptor 


5 


0 


0 


0 


0 


BDNF/NT-3 nerve growth factor 


4 


0 


0 


0 


0 


receptor 












Dual-specificity protein phosphatase 


Kinases and phosphatases 








29 


8 


10 


4 


11 


S/T and dual-specificity protein 










kinasef 


395 


198 


315 


114 


1102 


S/T protein phosphatase 


15 


19 


51 


13 


29 


Y protein kinase! 


106 


47 


100 


5 


16 


Y protein phosphatase 


56 


22 


95 


5 


6 


ARF family 


. Signal transduction 








55 


29 


27 


12 


45 


Cyclic nucleotide phosphodiesterase 


'.'25 .. 


8 


6 


1 v."--' 


0 
1 


G protein-coupled receptors! J 
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35 
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^'.transposition/ arid again there is evidence that 
many of these may. be pseudogenes (145) 
.However, : a-: second form (eEFlA2) of this 
factor has been identied with tissue-specific 
. ^expression in skeletal muscle and a comple- 
.^mentary expression pattern to the ubiquitous- 
"> • ly expressed eEFIA (146)1 

results in multiple transcnpts ,from a jingle 
■j gene, and ; can therefore, generate additional 
diversity in an . organism's -protein comple- 
. ment We have identified 269 genes for ri- 
.;.;bonucleoproteins. ; This ^ represents over 2.5 
. : times the number of ribonucleoprotein genes 
••••• in the worm, two times that of the fly, and 
■~- - about the same^as .me, 265 /identified in.the 
f > - Arabidopsis - genome/ W diversity 
?i*S>f .ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications. In this 
set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K- dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
- osteocalcin, and matrix GLA. protein (148). 
Tyrosylprotein ; -sulfotransferases : participate . 
in me posttranslational modification of. pro-. 
•: teins. involved in inflammation and hemosta- 
. sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not . found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate to the prominent differences in 
. the immune system, h em o stasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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significant expansion in two families of matrix 
metalloproteases: ADAM (a disintegrin and 
metalloprotease) and MMPs (matrix metallo- 
proteases). (Table 19). Proteolysis of extracel- 
lular, matrix (ECM) proteins js critical for tissue 
, development and for tissue! degradation in dis-^ 
eases such as cancer,' arthritis, 'Alzheimer's dis- . 
ease, and a variety of inflammatory conditions 
(135, 136). ADAMs are a family of integral . 
:.. membrane, proteins wim a pivotal.role'in fibrin- f. 
ogenolysis and modulating .interactions be- 
• tween hematopoietic . components and : the 
vascular matrix .components. These proteins > 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM-17 
converts tumor necrosis factor-a, and 
ADAM- 10 has been implicated in the Notch 
signaling pathway (135). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the - 
ADAM and ADAM-TS families. 

Apoptosls. Evolutionary conservation of 
some , of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 
regulatory enzymes (137). We enumerated 
the protein counts of central adaptor and ef- 
fector enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- . 
ers like BIR/CARD, and Bcl2 are represent- 
ed in the fly and worm (although the number 
of BcI2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms (138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector domain-containing proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein famines. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
oxygenase-activating proteins (four in humans) 
may be vertebrate-specific. Lipoxygenases are 
involved in arachidonic acid metabolism, and 
they and their activators have been implicated 
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in diverse human pathology ranging from 
allergic responses to cancers. One of the most 
. ; surprising human expansions; however, is in, 
, thevnumber . of glyceraIdehyde-3-phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans, 1 3,in the fly; and 4 in the. worm). There 
.'. is; however, evidence/for many- retrotrans-,: 



posed GAPDH pseudogenes (139), which 
may account for this apparent expansion. 
However, it is interesting that GAPDH, long 
known as. a conserved enzyme involved in 
,/basic;metabolism found acrbss all phyla from 
. ; bacteria to humaps, has recently been shown 
) -to have other 'functions. It has a second cat- 



jnelanogaster^ C elegant S? cerevisiae (Y), and A thal,ana(A) ' 
! Rahm'ervfar^ly/subfamtly 1 * - \ 



H F ; 

T " Neural structure, function, development 

Ependymin . . 
Ion .channels . . 
Acetylcholine receptor 
Amiloride-sensitive/degenerin 
CNG/EAG 
IRK 

: ITP/ryanodine - 

Neurotransmitter-gated . 
P2X purinoceptor 
TASK 

Transient receptor 
Voltage-gated Ca 2+ alpha 
■ Voltage-gated Ca 2+ alpha-2 
Voltage-gated Ca 2+ beta 
Voltage-gated Ca 2+ gamma 
Voltage-gated K + alpha 
Voltage-gated KQT 
Voltage-gated Na+ . 
Myelin basic protein 
Myelin PO 
Myelin proteolipid 

Myelin-oligodendrocyte glycoprotein 
Neuropllin V 
Plexin • " 
; . Semaphorin " 
Synaptotagmin 

Defensin 
Cytokinef 
CCSF 
GMCSF 

Intercrine alpha 
- Intercrine beta~~- ~ - . - - 
Inteferon 
Interleukin 

Leukemia inhibitory factor 
MCSF 

Peptidoglycan recognition protein 
Pre-B cell enhancing factor 
Small inducible cytokine A 
SI cytokine . 
TNF 

Cytokine receptorf 
Bradykinin/C-C chemokine receptor 
Fl cytokine receptor 
Interferon receptor 
Interleukin receptor 
Leukocyte tyrosine kinase 

receptor 
MCSF receptor 
TNF receptor • 
Immunoglobulin receptorf 
f-cell receptor alpha chain 
T-cell receptor beta chain 
T-cell receptor gamma chain 
T-cell receptor delta chain 
Immunoglobulin FC receptor 
Killer cell receptor 
Polymeric-lmmunoglobulin receptor 



Y 



1 


0 


0 


0 


0 


1 7 


12 


* 56 


0 


0 


1 1 


24 


27 


0 


0 


Zc. 


9 


• 9 


0 


30 


1 c 
16 


3 


•: ■ 3 ... 


:.0 


0 


* ■» 10 " 


~ 2 ' 


4 


• ; o 


. 0 


61 


51 


. 59 


.-' . 0 


19 


10 


0 


0 


0 


0 


12 


12 


48 


,1 


5 


15 


3 


3 


.1 


0 


22 


, 4 


8 . . 


• 2 


2 


10 


: 3 , 


2 .. •• 


0 


0 


5 


2 


2 


0 


0 


1 


0 


0 


0 


0 


33 


5 


11 


0 


0 


6 


2 


3 


0 


0 


11 


4- 


4 


9. 


1 


1 


0 


0 


0 


0 


5 


0 


0 


0 


0 


3 


1 


0 


0 


0 


1 


0 


0. 


0 


0 


■' .2 ■ ■' 


0 


o 


o 


0 


;9. 


2 


0 


0 


6 


22 


6 


2 


0 


0 


10 


3 


3 


o 


0 


Immune response 








3 


0 


0 


0 


0 


86 


14 


1 


0 


o 


1 


0 


0 


0 


0 


1 


0 


0 


0 


0 


15 


0 


0 


0 


0 


... ,/ 5 ..-~.— 


— b * 


""0 ' 


0 


0 


8 


0 


0 


■« 0 


0 


26 


1 


1 


0 


0 


1 


0 


0 


0 


0 


1 


0 


0 


0 


0 


2 


13 


0 


0 


0 


1 


0 


0 


0 


0 


14 


0 




0 


0 


2 


0 


0 


0 


0 


9 


0 


0 


0 


0 


62 


1 . 


0 


0 


0 


7 


0 


0 


0 


0 


2 


0 


0 


0 


0 


3 


0 


0 


0 


0 


32 


0 


0 


0 


0 


3 


0 


0 


0 


0 


1 


0 


0 


0 


0 


3 


0 


0 


0 


0 


59 


0 


0 


0 


0 


16 


0 


0 


0 


0 


15 


0 


0 


0 


0 


1 


0 


0 


0 


0 


1 


0 


0 


0 


0 


8 


0 


0 . 


0 


0 


16 


0 


0 


0 


0 


4 


0 


0 


0 


0 
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Table 18 (Continued) 



Accession 
number 



Domain name 



•i Domain description 



H 



W 



PF02135 
PF01285 
PF02176 
PF00352 

PF00567 
PF00642 
PF00096 
PF00097 
PF00098 



Zf-TAZ 
TEA 

Zf-TRAF 

; TBP 

TUDOR 

Zf-CCCH 

Zf-C2H2** 

Zf-C3HC4 

Zf-CCHC 



TAZ finger 
TEA domain 
TRAF-type zinc finger 

•Transcription factor TFIID (or TATA-binding 

.protein; TBP) 
TUDOR domain 

Zinc finger. C-x8-C-x5-C-x3-H type (and similar) 

Zinc finger, C2H2 type . 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 



: 2(3) 

4 

6(9) 
2(4) 

-9(24) 
17(22) 
, ; . 564(4500) : 
: . 135 (137) 
9(17) 



, 1(2) 

.•• .1 

1(3) 

x MB) 

-;9(19); 
"6(8); 
234(771) 
57 
6(10) 



. V 6(7) 

-. . -1 - 

ovv2(4) 

4(5): ;: 
' ,22 {42) ; : 

68(155); 
88(89) 
17(33) 



0 

- 1 • 
:0 

^1(2) 
0 

-3(5) 
34(56) 
18 
7(13) 



10(15) 

• '-2(4) 
2 

31(46) 
:V21 (24) 
,298(304) 

68(91) 



. (Tables 18- and 19). They include secreted 
, . hormones and growth factors, receptors, in- 
tracellular signaling molecules,' and transcrip- 
tion factors. 

I Developmental signaling molecules that are , 
• enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-0 (TGF-P), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
. tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 12 
ephrin receptors (2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
...are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in ... 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(757). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (752), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (755). A similar expansion in humans 
is noted in stnuctural proteins that constitute the 
actin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- • 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



, ; v ; ;. )i Comparison across the fiye sequenced eu- 
^ karyotic organisms revealed several expand- 
. ed protein families and domains involved in 
- cytoplasmic signal transduction (Table 18). 
; Tn , particular, .signal ; transduction "»pathways 
. .playing roles m developmental regulation and 
acquired immunity were substantially en- 
,/ , riched. - There is :. a factor of 2 or greater ex- 
pansion in humans in .the Ras superfamily 
.GTPases and the GTPase activator and GTP. 
exchange factors associated with them, . Al- 
though there are about the ,same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
A. in phosphotyrosine signal .transduction. Fur- 
ther, there is : a -twofold expansion' of .phos- 
phodiesterases in the human genome, com- 
pared with either the worm or fly genomes. : 
. The downstream effectors of the intracellu- . 
; lar signaling molecules include the transcription . 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
. binding nuclear hormone receptor class of tran- 
. .scription factors compared with the ^ly genome, ■> 
. .although not to the extent observed in the worm 
.' (Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not . only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN, do- 
: mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



>.home^orhams ;al6ne;'or;in combination with 
VPou and LIM : domains in all of the animal 
H r genomes.: m f plants,' : however, a different set of 
, j . transcription factors are expanded, namely, the 
-*;| : myb family, and a unique . set that includes VP1 
v / and - AP2 domain^ontaining proteins (754). 
/-The yeast genome has a paucity of transcription 
\ factors compared - with ?the multicellular eu- ' 

> karyotes, and its repertoire . is ;limited to , the 
... expansion of the yeast-specific C6 transcription 

- factor family' involved in metabolic regulation. 
While we have illustrated expansions in a 

. subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic .genomes, it should be noted that 

> most of the protein domains are highly con- 
served. An interesting , observation is that 

- -.worms and humans .have .approximately the 
• . .same: number ;bf both tyrosine kinases and 

/serine/threonine I 'kinases (Table 19). It is im- 
portant to note> however, that these are mere- 
ly counts of the catalytic domain; the proteins 
. . that contain these domains also display a 
Awide. repertoire .of interaction domains with" 
. significant combinatorial diversity. 
^^-^^Hemostasi&^Hemqstasis' is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there has been extensive re- . 
cnritment of moie-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely, impaired (130). Humans 
. have . : at least lO.genes. belonging to four; 
different families involved in myelin produc- ' 

Table 18 {Continued) 
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tion (five myelin PO, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos-; 

r sibly^more-rembtely.related members -of the'; 

:MOG iamily^jFlies haye only a single myelin I 
proteolipid,: and worms have none at all 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
; Many protein. families that have expanded jn 
humans relative to the invertebrates are in- 
t volved in signaling processes, particularly in 
; response Jo development and differentiation 



Accession 
: mimber. 

PF00254 
PF01590 
PF01344 
PF00560 
PF00917 
PF00989 
PF00595 
PF00169 
PF01535 
. PF00536 
v PF01369 
PF00017 . 
PF00018 
PF01740 
PF0OS15 
PF00400 
PF00397 
PF00569 

PF01754 
PF01388 
PF01426 
PF00643 
PF00533 . 
PF00439 
PF00651 
PF00145 
PF00385 

PF00125 
PF00134 
PF00270 
PF01529 
PF00646 
PFOO250 
PFOO320^ 
PF01585 
PF00010 
PF00850 
PF00046 
PF01833 
PF02373 
PF02375 
PF00013 
PF01352 
PF00104 



PF00412 
PF00917 
PF00249 
PF02344 
PF01753 
PF00628 
PF00157 
PF02257 
PF00076 

PF02037 
PF00622 
PF01852 
PF00907 



Domain name 

FKBP 
CAF - 
Ketch 
LRR** 
MATH 
PAS 
PD2 
PH 

PPR** 

SAM : 
Sec7 
. SH2 
. SH3 
STAS 
TPR** 
WD40** 
WW 
ZZ 

Zf-A20 
ARID . 
BAH 

Zf-BJ>ox** 
BRCT : . . 
Bromodomain 
BTB 

DNA_methylase 
Chrorrio 

Histone 
Cyclin 
DEAD 
Zf-DHHC 
F-box** 
Fore head 
GAT^'^ ~~~~ 
G-patch 
HLH** 

Hist_deacetyl 
Homeobox 
TIC 
JmjC 
JmjN 

KH-domain 
KRAB 

Hormone_rec 



LIM 
MATH 

Myb_DNA-binding 

Myc-U 

Zf-MYND 

PHD 

Pou 

RFXLDNAJ>inding 
Rrm 

SAP 
SPRY 
START 
T-box 



:• V; Domain description. 

FKBP-type peptidyl-prolyl cis-trans isomerases 
CAF domain 
Kelch motif 
..Leucine Rich Repeat 
MATH domain " 
PAS domain 

PDZ domain (Also known as DHR or GLGF) 
PH domain • .' 

PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-Zinc finger present in dystrophin, CBP/p300 

Nuclear interaction domains 

A20-like zinc finger 
ARID DNA binding domain 
BAH domain 
B-box a'nc finger 

BRCA1 C Terminus (BRCT) domain 
Bromodomain - ' . 
BTB/POZ domain . 
C-5 cytosine-specific DNA methylase 
chromo* (CHRromatin Organization Modifier) 
domain 

. Core histone H2A/H2B/H3/H4 
Cydin 

DEAD/DEAH box helicase 
DHHC zinc finger domain 
F-box domain 

5 fork head dorham ^ n _ 

CATA zinc finger 
G-patch domain 

Helix-loop-heUx DNA-binding domain 
Histone deacetylase family 
Homeobox domain . 
IPT/TIC domain 
JmjC domain 
JmjN domain 
KH domain 
KRAB box 

Ugand-binding domain of nuclear hormone 

receptor 
LIM domain containing proteins 
MATH domain 

Myb-like DNA-binding domain 
Myc leucine zipper domain 
MYND finger 
PHD-finger * 

Pou domain— N-terminal to homeobox domain 
RFX DNA-binding domain 
RNA recognition motif (a.k.a. RRM, RBD. or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T-box 



H 

15(20) 
7(8) 
54(157) 
25(30) 
11 

18(19) 
96(154) 
193(212) 
5 

: 29(31) 
13 

87(95) 
143 (182) 
5 

72(131) 
136 (305 
32(53 
10(11) 

2(8) 
11 
8(10) 
32(35) 
17(28) 
37(48) 
97(98) 
3(4) 
24(27) 



■ 7(8) 
2(4) 
12(48) 
24(30) 
5 

9(10) 
60(87) 
72(78) 
,3(4) 
15 
5 

. 33(39) 
55(75) 
1 

39(101) 
98(226) 
24(39) 
13 



W 

703) 
1 

13(41) 
7(11) 
88 (161) 
6 

46(66) 
65(68) 
0 
8 
5 

44(48) 
46 ( 61) 
6 

28(54) 



4 
0 

•-. • 3 
1 
1 
1 
2 
24 
1 

: 3 

5 
1 

23(27) 
2 

16(31) 



72(153) • 56 (121) 
16(24) 5(8) 
10 2 



2 
6 

7(8) 
1 

.10(18) 
16(22). 
62(64) 
■ 1 
14(15) 



2 
4 

4(5) 
2 

23(35) 
18(26) 
86(91) 
0 

17(18) 



0 
2 
5 
0 

10(16) 
10(15) 

■ 1(2) 
0 

1(2) 



75(81) 


5 


71 (73) 


8 


19 


10 


10 


11 


63 (66) 


48(50) 


55(57) 


50(52) 


15 


20 


16 


7 


16 


' 15 


309(324) 


9 


35 (36) 


-2Q(2lV 


.... .15.., 


: 4^ 


11(17) 


5(6) 


8(10) 


9 


: 18 


16 


13 


4 


60(61) 


r 44 


24 


4 


12 


5(6) 


8(10) 


5 


160 (178) 


100(103) 


82(84) 


6 


29(53) 


11(13) 


5(7) 


2 


10 


4 


6 


4 


7 


4 


2 


3 


28(67) 


14(32) 


17(46) 


4(14) 


204(243) 


0 


0 


0 


47 


17 


142(147) 


0 


62 (129) 


33(83) 


33(79) 


4(7) 


11 


5 


88(161) 


1 


32(43) 


18(24) 


17(24) 


15(20) 


1 


0 


0 


0 


14 


14 


9 


1 


68(86) 


40(53) 


32(44) 


14(15) 


15 


5 


4 


0 


7 


2 


1 


1 


224(324) 


127(199) 


94(145) 


43(73) 


15 


8 


5 


5 


44(51) 


10(12) 


5(7) 


3 


10 


2 


6 


0 


17(19) 


8 


22 


0 



24(29) 
10 

102(178) 
15(16) 
61 (74) 
13(18) 
5 
23 

474(2485) 
6 
9 
3 
4 
13 

65(124) 
167(344) 
11(15) 
10 

8 
7 

21(25) 
0 

12(16) 
28 
30(31) 
13(15) 
12 

48 
35 
84(87) 
22 

165(167) 
. . 0 

26 ' 
14(15) 
39 
10 
66 
1 
7 
7 

27(61) 
0 
0 

10(16) 
61 (74) 
243 (401) 
0 
7 

96 (105) 
0 
0 

232 (369) 

6(7) 
6 
23 
0 
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Accession 
-V. number . 


Domain name 


■.' PF00620 


RhoGAP 


PF00621 


RhoGEF 


PF00536 


SAM 


PF01369 


Sec7 


, PF00017 


SH2 


: PF00018 


SH3 


PF01017 


STAT 


PF00790 


VHS 


-r.. PF00568 


WH1 


PF00452 


Bcl-2 


PF02180 


=• BH4 


PF00619 


CARD 


PF00531 


Death 


. PF01335 


DED 


PF02179 


BAG 


PF00656 


ICE_p20 


PF00653 


BiR 


PF00022 


Actin 


DCAA1Q1 


AnncXin 


PF00402 


Catponin 


PF00373 


Band_41 


PF00880 

r i vvuuv 


Nebutin reoeat 


. PF00681 


Plectin__repeat 


PF00435 


Spectrin 


PF00418 


Tubulin-binding 


PF00992 


Troponin 


PF02209 


VHP 


PF01044 


Vinculin 


PF01391 


Collagen 


PF01413 


C4 


PF00431 


CUB 


PF00008 


EGF 


PF00147 


Fibrinogen_C 


PF00041 


Fn3 


PF00757 


Furin-like 


PF00357 


lntegrin_A 


PF00362 


... Integrin.B 


PF00052 


Laminin.B 


PF00053 


Laminin_EGF 


PF00054 


Laminin_G 


PF00055 


Laminin_Nterm 


PF00059 


Lectin_c 


rrU I*t03 


i Dorr 


PF01462 


LRRNT 


PF00057 . 


LdLrecept_a 


PF00058 


LdLrecept_b 


PF00530 


SRCR 


PF00084 


Sushi 


PF00090 


TspJI 


PF00092 


Vwa 


PF00093 


Vwc 


PF00094 


Vwd 


PF00244 • 


14-3-3 


PF00023 


Ank 


PF00514 


Armadillo_seg 


PF00168 


C2 


PF00027 


cNMPJ>inding 


PF01556 


DnaJ_C 


PF00226 


DnaJ 


PF00036 


Efhand** 


PF00611 


FCH 


PF01846 


FF 


PF00498 


FHA 



. Domain description 



H 



W 



RhoGAP domain 59 

RhoGEF domain ,46 

SAM domain (Sterile alpha motif) : / 29 (31) 

Sec7 domain 13 

Src homology 2 (SH2) domain . 87 (95) 
Src homology 3 (SH3) domain " '143(182) 

STAT protein 7 

VHS domain 4 

WH1 domain 7 

. Domains involved in apoptosis 

Bd-2 "9 

Bd : 2 homology region 4 ' .3 

Caspase recruitment domain 16 

Death domain _ "16. 

Death effector domain 4 (5) 

Domain present in Hsp70 regulators 5 (8) 

ICE-like protease (caspase) p20 domain 11 

Inhibitor of Apoptosis domain 8(14) 

Cytoskeletal 

Actin . " 61(64) 

Annexin ^ : 16(55) 

Calponin family 13 (22) 

FERM domain (Band 4.1 family) 29 (30) 

Nebulin repeat . 4(148) 

Plectin repeat 2(11) 

Spectrin repeat 31(195) 

Tau and MAP proteins, tubulin-binding 4(12) 

Troponin 4 

Villin headpiece domain 5 

Vinculin family .4 

' ••; :> ECM adhesion . . 

Collagen triple helix repeat (20 copies) 65(279) 

C-terminal tandem repeated domain in type 4 6 (11) 
procollagen 

CUB domain 47 (69) 

EGF-like domain 108(420) 

Fibrinogen beta and gamma chains, C-terminal 26 
globular domain 

Fibronectin type ill domain 106(545) 

Furin-like cysteine rich region 5 

Integrin alpha cytoplasmic region 3 

Integrins, beta chain 8 

Uminin B (Domain IV) 8(12) 

Laminin EGF-like (Domains lll and V) 24 (126) 

Uminin G domain 30 (57) 

Laminin N-terminal (Domain VI) 10 

Lectin C-type domain 47 (76) 

Leucine rich repeat C-terminal domain 69 (81) 

Leucine rich repeat N-terminal domain 40 (44) 

Low-density lipoprotein receptor domain class A 35 (127) 

Low-density lipoprotein receptor repeat class B 15 (96) 

Scavenger receptor cysteine-rich domain 1 1 (46) 

Sushi domain (SCR repeat) 53 (191) 

Thrombospondin type 1 domain 41 (66) 

von Willebrand factor type A domain 34 (58) 

von Willebrand factor type C domain 19 (28) 

von Willebrand factor type D domain 15 (35) 
Protein interaction domains 



14-3-3 proteins 
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DnaJ C terminal region 


12 


DnaJ domain 


44 
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83 (151) 


Fes/CIP4 homology domain 
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FF domain 


4(11) 


FHA domain 
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0 
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. '. • "5 
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15(16) 
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Table 18 (Continued) 



Accession 
number 



. Domain name 



. Domain description 



W 



PF00594 7 ,Cla 



PF00711 
PF00748 
PF00666 
PF00129 

. PF00993 
PF00969 
PF00879 
PF01109 
PF00047 
PF00143 
PF00714 
PF0O726 
PF02372 
PF0O715 
PF0O727 
PF02025 
PF01415 
PF00340 
PF02394 
PF02059 
PF00489 
PF01291 

PF00323 
PF01091 
PF00277 
PF00048 

PF01582 
. PF00229 
PF00088 

PF00779 
PF00168 
PF00609 
PF00781 
PF00610 

PF01363 

PF00503 
PF00631 
PF00616 
PF00618 

PF00625 
PF02189 
PF00169 
PF00130 

PF00388 

PF00387 

PF00640 
PF02192 
PF00794 
PF01412 
PF02196 
PF02145 
PF00788 
PF00071 
PF00617 
PF00615 
PF02197 



Defensin_beta 
Calpainjnhib 
: Cathelicidins 
MHCJ 

MHCJLalpha** 
. MHCJLbeta** 

Defensin_propep 
'. GM_CSF 

. 

Interferon 
IFN-gamma 
IL10 
IL15 
IL2 
IL4 
IL5 
IL7 
IL1 

IL1_propep 
IL3 
IL6 

UFJDSM 

Defensins 
PTN_MK 
SAA^roteins 
IL8 

TIR 
TNF 
Trefoil 

BTK 
C2 

DACKa 
DAGKc 
DEP 

_ FYVE 

G-alpha . 
G-gamma 
RasGAP 
RasGEFN 

Guanylatejdn 
ITAM . 
PH 

DAC_PE-bind 
PI-PLC-X 
Pi-PLC-Y 
PID 

PI3ie P 85B 
PI3ICrbd 
ArfGAP 
R8D 

Rap_GAP 
RA 
Ras 

RasGEF 
RGS 
Rlla 



,. Vitamin K-dependent carboxylation/gamma- 
. .- /carboxyglutamic (GLA) domain 

: Immune response 

Beta defensin 

Calpain inhibitor repeat ■ 

Catheliddins 

.. Class I histocompatibility antigen^, domains alpha i 

and 2 . ■ • .. . . .. .. 

Class II histocompatibility antigen, alpha domain 
• ; Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleukin-10 

lnterleukin-15 " 

lnterleukin-2 ' 

lnterleukin-4 

lnterleukin-5 

lnterleukin-7/9 family 

lnterleukin-1 

lnterleukin-1 propeptide 

lnterleukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia inhibitory factor (LIF)/oncostatin (OSM) 

family 
Mammalian defensin 

PTN/MK heparin-binding protein . . 

Serum amyloid A protein 

Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain 

TNF (tumor necrosis factor) family -. 
Trefoil (P-type) domain 

. Pl-PY-rho CTPase signaling 

BTK motif 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found In Dishevelled, Egl-10, and 
Pleckstrin (DEP) 
- F-^Y? ?'Q? finger 
GDP dissociation "inhibitor 
G-protein alpha subunit 
G-protein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/dtacylglycerol binding domain (CI 
domain) 

Phosphatidylinositol-specific phospholipase C, X 
domain 

Phosphatidylinositol-spedfic phospholipase C, Y 
domain 

Phosphotyrosine Interaction domain (PTB/PID) 
PI3-kinase family, p85-binding domain 
PI3-kinase family, ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-like Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PKA R-subunit 
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Tatye 18. Domain-based comparative analysis of proteins, in H. sapiens (H), 
D. metanogaster (F), C etegans (\M), 5. cerev/s/ae (Y), and A. •thaliana (A).The y 
predicted protein set of each of the above eukaryotic organisms was analyzed < 
, with Pfam version 5.5 using E value cutoffs of 0.001; The number of proteins 
;. containing the specified Pfam domains as well as the total number of domains li 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (U./SH2) are listed in 



;:■ more than one cellular process.; Results of the Pfam.analysis may differ from 
^results obtained based/on human curation;pf -pVotein. families, owing to the 
3 limitations of large-scale, automaticrclassifications. Representative examples 
of domains with reduced, counts' owing to the stringent E value cutoff used for 
.:.this analysis are marked with a. double asterisk (**)/ Examples include short 
divergent and predominantly alpha-helical domains; and certain classes of 
cysteine-rich zinc finger proteins. / 



Accession 
number 



Domain name 



Domain description 



W 



• • : '^^Developmental and homeostatic 

z PF02039 Adrenomedullin Adrenomedullin 

PF00212 ANP . Atrial natriuretic peptide 

PF00028 Cadherin Cadherin domain 

PF00214 Calc.CGRPJAPP Calcitonin/CGRP/IAPP family 

PF01110 CNTF . . Ciliary neurotrophic factor 

PF01093 ^Clusterin . Clusterin 

• PF00029 Connexin Connexin 

PF00976 . ACTH_domain , Corticotropin ACTH domain 

PF00473 . ' CRF r ■ Corticotropin-releasing factor family 

PF00007 Cysjcnot Cystine-knot domain 

PF00778 DIX Dix domain 

PF00322 Endothelin Endothelin family 

PF00812 Ephrin Ephrin 

PF01404 EPhJbd Ephrin receptor ligand binding domain 

PF00167 FGF .Fibroblast growth factor 

PF01534 Frizzled . . Frizzled/Smoothened family membrane region 

PF00236 Hormone6 Glycoprotein hormones 

PF01153 Glypican Glypican 

PF01271 Granin Grainin (chromogranin or secretogranin) 

PF02058 Guanylin ... Guanylin precursor 

PF00049 Insulin Insulin/IGF/Relaxin family 

PF00219 IGFBP Insulin-like growth factor binding proteins 

PF02024 Leptin Leptin 

PF00193 Xlink LINK (hyaluron binding) 

PF00243 NGF . Nerve growth factor family 

PF02158 Neuregulin . Neuregulin family - 

PF00184 Hormone5 Neurohypophysial hormones . 

PF02070 NMU Neuromedin U 

PF00066 Notch Notch (DSL) domain 

PF00865 Osteopontin Osteopontin 

PF00159 Hormone3 Pancreatic hormone peptides 

PF01279 Parathyroid Parathyroid hormone family 

PF00123 Hormone2 Peptide hormone 

PF00341 PDGF Platelet-derived growth factor (PDGF)~" 

PF01403 Sema Sema domain 

PF01033 Somatomedin_B Somatomedin B domain 

PF00103 Hormone Somatotropin 

PF02208 Sorb Sorbin homologous domain 

PF02404 SCF Stem cell factor 

PF01034 Syndecan Syndecan domain 

PF00020 TNFR_c6 TNFR/NGFR cysteine-rich region 

PF0001 9 TGF-p Transforming growth factor p-like domain 

PF01099 Uteroglobin Uteroglobin family 

PF01160 Opipds^neuropep Vertebrate endogenous opioids neuropeptide 

PF00110 Wnt VVnt family of developmental signaling proteins 

Hemostasts 

PF01821 ANATO Anaphylotoxin-like domain 

PF00386 Clq CI q domain 

PF00200 Disintegrin Disintegrin 

PF00754 F5..F8_type_C F5/8 type C domain 

PF01410 COLFI Fibrillar collagen C-terminal domain 

PF00039 Fnl Fibronectin type \ domain / 

PF00040 Fn2 Fibronectin type II domain 

PF00051 Kringle Kringle domain 

PF01823 MACPF MAC/Perforin domain 

PF00354 Pentaxin Pentaxin family 

PF00277 SAA_proteins Serum amyloid A protein 

PF00084 Sushi Sushi domain (SCR repeat) 

PF02210 TSPN Thrombospondin N-terminaWike domains 

PF01108 Tissuejac Tissue factor 

PF00868 Transglutamin^N Transglutaminase family 

PF00927 Transglutamin.C Transglutaminase family 
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zymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
resented in the shared protein set. Proteases 
. form the largest part of this category, and 
several large protease families have expanded 
in eabti of these three .organisms . after their 
divergence.. The category of select regulatory 
.molecules is also overrepresented in the con-; 
served set. The v maj>r . conserved families are. 
small guanosine , triphosphatases (GTPases) 
| (especially the Ras-related superfamily, in- 
cluduig ADP ; ribosylation factor) , and cell 
• cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are. protein trans- 
port and trafficking, and chaperones. The , 
most conserved groups in these categories are 
, proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein > 
folding and heat-shock response [particularly . 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 - families]. ■ 
These observations provide only a conserva- '\ 
tive estimate of the ; protein families .m.the; i 
context of specific cellular processes that : 
were , likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
thologs difficult within the members of con- 
served protein families. 

7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 

To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
humaiugenome witfe the other sequenced -* t 
eukaryotic genomes at three levels: molec- ■ 
ular functions, protein families, and protein 
domains. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic. genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared i 
molecular function, which may include sev- - 
era! sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. 
We have found that the most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



the Human Genome 

in development and homeostasis; (iv) nemo-, 
stasis; and (v) apoptosis. 

Acquired immunity. One of the most 
striking .differences .between ,the human .ge- 
rmome.,and:the Drosophila: or C. elegans ge- 
; j nome.: is the appearance, of genes involved in 
acquired immunity (Tables 18 and 19).-This, 
is expected,: because the acquired immune; 
. response is ^.defense system that prily.roccurs • 
in vertebrates We observe 22 class I and 22 
class II major^histocompatibihty .complex. 
(MHG) antigen genes and 1 14 other immu- 
> noglpbulin • genes .in k the ihuman ; genome. J In / 
addition, there are .59 . genes . in the .cognate . 
, immunoglobulin receptor family. At the do- 
mam level, mis is exemplified by an expan- 
, ;sion and recruitment of the. ancient immuno- . 
. ... globulin fold to constitute molecules such as - 
; ..MHC, .and of the integrin fold to form several ■ 
. ; of the -cell, adhesion molecules' that, mediate, 
.interactions between; immune effector cells/ 
i and . the j.extracellular. matrix. : Vertebrate-spe- . 
;-. cific proteins, include the paracrine immune ■ 
regulators family, of secreted 4-alpha helical 
bundle : proteins^ namely me cytokines , and ^ 
=; .chemokines.> Some of the cytoplasmic signal ;- 
, transduction components associated with cy- 
tokine, receptor signal ..transduction; are also 
features that are poorly represented in the fly • 
, and . worm.' These include . protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi-v 
tors, of .activated STATs (PIAS). In contrast, > 
<-. many of the animal-specific protein domains', 
that play a role in innate immune response,- , 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 

Neural development, structure, and 
function. In the human genome, as compared 
with the worm and fly genomes, there is a 
-marked-increase-in;the number of members 
of protein families . that .are . involved in 
v neural. development. Examples include neu- '? 
ro trophic factors such as ependymin, nerve, 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (12 1); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 
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of prominent protein expansions are in- 
volved in the processes of neural develop- 
; % . rnent;,Of the extracellular domains that me,- 
r 4 ^diate^cell ^adhesion, V the -cdnnexin domain- 
containing proteins (122) exist only in hu- 
, • rmans.These proteins^ which are not present 
\r. - in the Drosophila or p. . elegans. genomes, 
> appear to provide the constitutive subunits 
. of intercellular. tchannels.'arid the structural 
basis. for electrical coupling -Pathway find- 
v:/jhg by axons and neuronal network forma- 
•••• tion is mediated through a subset of ephnns 
v ; j and their cognate receptor .tyrosine kinases 
V' ; that .act as positional Jabels ; to.: establish 
/ topographical projections (725). The : prob- 
: able biological role for the semaphorins (22 
; in human compared with' 6 in the fly. and 2 
t in the worm) and their, receptors .(neuropi- 
)J lins and plexins) is that of axbn'al guidance 
v -molecules (124) .. Signaling molecules such 
- as neurotrophic factors -and some cytokines 
;>..;have. been shown to 'regulate neuronal cell 
survival, proliferation;'*and axon guidance 
.; (125): 'Notch receptors arid ligands play 
; : important/roles inVglial cell fate determinar 
. ition and gliogenesis (126), 

Other human expanded, gene families play 
key roles* directly, in neural "structure and 
. . function. One example; is synaptotagmin (ex- ' 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
.fulate synaptic transmission by. serving as a 
Ca 2+ sensor (or receptor) .during synaptic 
. vesicle fusion and release (127);Of interest is 
the increased co-occurrence fin humans of 
. PDZ and the SH3: domains in -<neuronal- 
•; specific adaptor molecules; examples include 
: proteins that likely modulate channel activity 
at synaptic junctions (128). . We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
i (related to cyclic nucleotide gated channels), " 
;" the. ;voItajge-gafed..;caiciu^soa^um ^channel 
family,- the inward-rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 



www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1337 



The Human genome 

' <7.2 Evolutionary conservation of core . . (720, , we identified 
processes . ..-.each . pairwise .comparison (human-fly > and -vWith : unambiguous : one-to-one relationships 

Because of the .various . "model .organism" human-worm). The first case was a pair of . ^(Fig;\16). By -mese- criteria, there are 2758 
;. ■> genome-sequencing projects that have al- : v.* genes, , one from each organism, for which • - strict human-fly ! orthologs, 203 1 human- 
. ready been completed, reasonable compara- ■ : , there was no other close homolog in either ;. ; .;Worni .(1523 .m :cbmmbh between these sets), 
tive information is available for beginning the , J. organism. These are straightforwardly identi- ^-;;We .define the eyolutionarily conserved set as 
•- analysis of the evolution ; of the human ge- ; ;i fied as orthologous, because there are ; no . those 1 523 ■ human proteins . that have strict 
. / nome. The genomes of ' S. cerevisiae ("bak- ; ;. ;> additional members of me families that com-, v;c.6rthiologs. : in-.'bpmi^/;v^e/awpg^^ryMtf r & ; 
ers\ yeast") (US)" and ..two diverse mverte-^;:^ plicate! separating .orthologs --from .paralogs J- '-^.elegans.'-i ;-"V ''.V/' 

brates, C. elegans (a nematode worm) (119) a The :second case is. a family of genes wim y rvr ^ iThe d^ 
. and D. melanogaster (fly) (2d), as well as the . ; . more than one member in either or both of the . : ;. .conserved protein .set is shown in Fig. 16, 
.;; ..first plant genome, ^. f/ia/wwa, recentiy com-; • .organisms being compared. Chervitz et al /Comparison with ; Fig. ;.15 shows that, not 
. . . pleted (P2), provide a diverse background for . (720 ; deal with this case, by analyzing a .surprisingly, me. set. of conserved protein 
genome comparisons^ ^ 

: We enumerated the "strict orthologs" con- -/ ^ snips between all of the sequences \in both -\ . .the same way as the whole human protein set. 
served between human and fly, and between . y. : org^ 
; . ■ human and worm (Fig. ,16) to., address ,the • /mat were nearest. neighbors in me tree. If me> .; 15); there are several categories that are over- 
v question, What • are the core functions . that .v ; v; nearest-neighbor pairs, were from ' different ».;;• represented in the'conser^ed set by afactor of 
appear . to be common across the animals? . : organisms, those genes were presumed to be . ;V ; or more. The first category is nucleic acid 
The concept of orthology is important be- ^ orthologs. We note mat mese nearest neigh- : > ma- 
. cause if two genes are orthologs, they can be • v bors can often be confidently identified from . : chinery * (notably V DNA/RNA methyltrans- 
traced by descent to the common ancestor of .pairwise sequence comparison without hay- , / ferases, .DNA/RNA; polymerases, helicases, 
the two organisms (an "evplutionarily con-,'. . ing to examine a phylogenetic - tree (see leg-? ^ DNA - ligases, -DNA- ' andf RNA-processing 
served protein set"), and therefore are likely end to Fig. 16). If the nearest neighbors .are . factors, nucleases, and ribosomal proteins), 
to perform similar conserved functions in the \ v not from different organisms, there has been The. basic transcriptional and translational 
different organisms. It is critical in this anal- . , a paralogous expansion in one or both organ- - machinery is well known to have been con- 
ysis to separate orthologs (a gene that appears , isms after the speciatiori event (and/or a gene served over evolution, from bacteria through 
in two organisms by descent from a common loss by one organism). When this one-to-one to the most complex eukaryotes. Many ribo- 
ancestor) from paralogs (a gene that appears , correspondence is lost, defining an ortholog . nucleoproteins involved in RNA splicing also 
, in more than one copy in a given organism by • becomes ambiguous. For our initial compu- ; appear to be conserved . among the animals. ■ 
. a duplication event) because paralogs may Vto^ 

subsequently diverge in function. Following tein set, we could not answer, this question. for .yed .(transferases/; oxiddreductases, ligasesj 
the yeast-worm ortholog comparison in . . every predicted protein. Therefore, we con- vlyases,« and isomerases). Many of. these en- ; 

Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BLASTP P-value of <MCT 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
ism, i.e. ( there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs. By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 
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transporter (44, 2.6%) 




transferase (70, 4.1%) 

synthase and synthetase (64, 3.7%) ' 

oxidorcductasc (64. 3.7%) 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
. been. built, annotated, and reviewed by ex- 
V. P ert biologists. In the set of computationally 
• : r predicted genes, we expect both, false-positive 
. ' predictions (some of these may in fact be inac- 
' tive -pseudogenes) and false-negative predic- 
; ; tipns (some human genes will hot be computa- 
• , tionally predicted);. We also, expect enbfc in 
delimiting the boundaries of exons and genes. 
Similarly, in. the automatic functional assign- 
: ments, we also expect both false-positive and 
/ false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
•;, fore, we do not assign a function to many genes 
. that are not in large families, even if the func- . 

ton is known. Unless otherwise specified, all 
, enumeration of the genes in any given family or 
functional category was taken from the set of - 
26,588 predicted proteins, which were assigned 
. functions by using statistical score cutoffs de- 
fined . for models . in , Panther, Pfam, . and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized with current classification meth- 
ods? (ii) What are. the core functions that, 
appear to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted;^;; 
human proteins 

Figure 15 shows an overview, of, the puta-.i 
tive molecular; functions. of ; the predicted , 
26,588 .human, proteins ,'fhat have :at , least 
,two lines of supporting evidence About : 
, ; 4 1 % (1 2,809) ; of -the; gene -products' \d6uldS 
not be. classified:;'from this initial,;. analysis-, 
• and . are . termed" .proteins - with unknown .; 
functions. Because our -automatic.- classifi-/- 
cation methods treat only relatively .large 
protein families, there are a number of 
.. "unclassified" sequences, .that do, in .fact, 
have a known or predicted function. For the 
. 60% of the protein set that =have automatic -." 
functional predictions, ; the specific protein : 
functions have been -.placed into- broad 
classes. We focus here on molecular .tunc- : . 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins . 
as possible. .These functional predictions 
are , based on .similarity ; to sequences of :■ 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those, with only " 
one piece of supporting ..evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third . of these 636 . predicted 
genes represented endogenous retroviral. pro- v 
teins, further suggesting that the majority , of ^ 



these unknown-function genes are not real 
genes. Given that most of these additional 
.12,095 genes appear to be unique among the 
p genomes sequenced.to^tei'many^may simply 
: ...represent false-positive gene predictions. 
- 'le-v The most, common molecular functions are 
the transcription factors and those involved in 
; ; nucleic acid metabolism (nucleic acid enzyme). 
Other functions that are highly represented in 
the human genome are the receptors, kinases, 
^ and hydrolases.; Not js^ the •■' 

^•hydrolases are proteases, *There are -also many 
proteins, that ;are members of proto-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
• ; trimeric GTP-binding proteins (G proteins) and 
; . cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 



Table; 17. \ Distribution of SNPs ' im ; classes of 
genomic regions. • 



Genomic region ' 
class 


\ i .Size of; . 

region 
. . examined 

(Mb) 


; Celera-PFP 
SNP 
density 
. (SNP/Mb) 


Intergenic 


- 2185 


707 


Gene (intron + 


646 


917 


exon) 






Intron 


615 


921 


First intron . 


164 


'808 


Exon 


.31 


>';:.'\ 529 


: First exon 


• 10 


592 
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signaling molecule (376, 1.2%) 



reccptbr(l543,S.0%) 

kinase (868, 2.8%) 

select regulatory molecule (988, 32%) 

transferase (6 1 0,2.0%) 
synthase and synthetase (313, 1 .0%) 

«idorcductasc(63©\2.l%)^ / 
frasc(M7,0.4%K 
Iigase(56,02%) 
isomcrasc(163 ( 

hydrolase (1227. 4.0%) 





cytoskclctal structural protein (876, 2.8%) 
extracellular matrix (437, 1.4%) 

r immunogIobuIin_(264, 0.9%) . 

ion channel (406, 13%) 
motor (376. 12%) 

structural protein of muscle (296\ 1 .0%) . 
protooncogene (902, 19%) 
^select calcium binding protein (34. 0.1%) 

, intracellular transporter (350, 1.1%) 

'tansportcr(533,1.7%) 





^ GO categories 



molecular function unknown (1 2809, 4 1 ,7%) 



Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
r-..sligejists-the-numr- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene" Ontology 
(GO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 



Panther categories 
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somes, . and ; whether . this .heterogeneity is. 
greater .than expected by chance. If SNPs 
occur by random and independent mutations; 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size.The.ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores • 
the different recombination rates and popula- 
tion histories that exist in different regions of 

• the genome. Population genetics theory holds 

.that we can account for. this variation with a 
mathematical formulation called the neutral 
coalescent (109). Applying well-tested algo r 

vrithms for simulating the neutral coalescent 
with recombination (110),. and using anef- : 
fective population size of 10,000 and a per- 

. base recombination rate equal to the mutation 
rate (111), we generated a distribution of num- 
bers of SNPs by this model as well (112). The 

': observed distribution of SNPs has a much larg- 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant 
variability across the genome in SNP density, 
an observation that begs an explanation. 

Several attributes of the DNA sequence 
may. affect the local density of SNPs, in- 
cluding the rate at which DNA. polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 



0.05 
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: ,otides; : We tallied the GC content and nu- ' 
• cleotide , diversities . in 100-kbp windows ? 
-.••across the entire genome and found that the ; 

■ correlation between them was positive (r = 
. . ; 0.21) and highly significant (Pl<. 0.0001), 

-but G+C -content accounted for .only a,; 
rt small part of the variation. •-. -. 

;6.5 SNPs by genomic class 

• >{£o: test ' ^homogeneity, of , SNP .^densities 
. across functional classes, we partitioned 
. vsites into intergenic (defined as ; >5 kbp 
; ;from any predicted transcription unit), 5'- 
: UTRi exonic , (missense and silent), in- 

• tronic, and 3VUTR: for ,10,239 known-; 
y.) genes, .derived from the NCBI:RefSeq da- : 
v tabase and all human genes predicted from 
^.;the Celera Otto annotation. In' coding: re-7 
i \gions, SNPs were categorized as either, si- 

lent,: for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 

- missense to silent coding SNPs hv Celera- 1 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 

■ duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid. changes (772). These ratios are com- 

...parable to. the missense-to-silent ratios of 
0.88 and 1:17 found by Cargill et ah (101) 
and by Halushka et al. (102). . Similar re- 

- suits were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 1.0,239 Ref- 
i\ Seq genes, missense SNPs were only about 




Number of SNPs / 100 kb 

Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



: ;0.12,'*0;14,:rand .0:i7 0 /o of the total SNP 
• .counts i\in\ Celera-PFP, TSC, and Kwok 
x SNPs; respectively: -Nonconservative pro- 
rOtein changes constitute an even smaller frac- 
i .tion of missense SNPs (47, 41, and 40% in 
?' Celera-PFP, Kwok, and TSQ. Intergenic re- 
vgions have been 'virtually unstudied (113), and 
v: .we note that 75% of the; SNPs :we . identified 
/ were intergenic (Table 17). The SNP rate was 
{•highest in introns and lowest in exons: The SNP 
r: rate, was lower, in intergenic ^regions than in 
introns, providing one of the first discnrninators 
between these two classes of DNA. these SNP 
; > : rates.were confirmed in the Celera SNPs, which 
- also exhibited ^ lower rate, in exons than in 
: v intronSy and in extragenic regions than in in- 
trons (46). Many of these intergenic SNPs will 
^ provide. 5 ^valuable^infbrmation in the form of 
markers for linkage and association studies, and 
isome. fraction. is likely to have a regulatory 
■ function as well.' 

K7* An Overview of the Predicted 
( Protein-Coding Genes in the Human 
Genome 

■i Summary. : This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 
prominent . differences and \ similarities 

: when the human genome is compared with 

> other fully ^sequenced eukaryotic genomes. 

: Over 40% of, the predicted protein set in 

• . humans cannot, be ascribed .a - molecular 
function by methods that assign proteins to 

, known families. A' protein domain-based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when .compared with the fly and 

i worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins mat represent 26,383 gene 
predictions with atleast twolines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prelimi- 
nary and are-subject to several limitations. 
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fig. 13. Segmental duplica- 
tions - between chromo- 
y. somes In the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10310 
pairs of genes in total. Each 

- line represents a pair of ho- 
. mologous genes belonging 
. to a block; all blocks con- 

■ tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 

- duplications between a ' 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 

. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
close-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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(101, 102). The filtering steps consisted of i 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
non-to-trar* version . ratio from I 1 .57 • i to 
1.89: 1/When applied to 2.3 Gbp,of alignments 5 . 
/ between . the : Celera and PFP consensus V 
quences, these filters resulted in identification 
of 2,104,820 putative SNPs. from a total of 
2,7.78,474 ; substitution differences: ' Overlaps : 
between this set of SNPs and those found by 
• ; other methods are described below. : ■ >1 '". 

6.2 Comparisons to public SNP 
databases , : 

^l d x!i° naI SNPs ' incIudi ng 2,536,021 from ' 
dbSNP (www.ncbi.nlm.nih.gov/SNP) and 
> 13,150 from HGMD (Human Gene Muta- 
■ tion , Database, from . the University of ' 
.Wales, UK), were mapped on the Celera con- 
• sensus sequence by . a .sequence similarity 
search with the program PowerBlast (/05) The 
, two largest data sets in dbSNP are the Kwok 
: and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of the dbSNP sequence , and align- - 
ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated dbSNP se- 
quences mapping to. multiple locations on the 
Celera genome were discarded A total of 
.2,336,935 dbSNP . variants . were mapped to 
1,223,038 unique locations on the Celera se- 
quence, implying considerable redundancy in 
dbSNP, SNPs .in the JSC set . mapped to 
585,8 1 1 unique genomic locations, and SNPs in * 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used • 
in this analysis, including Celera-PFP TSC l - 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial fraction of SNPs identified by one of . 
these methods was also found by another meth- 
od The veryliigh overlap (36.2%) between the 
Kwok and Celera-PFP SNPs may be due in part 
to the use by Kwok of sequences that went into 
the PFP assembly. The unusually low overlap 
(16.4%) between the Kwok and TSC sets is due 
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■S » ^£^1^ * f diti0D ; These data are not ™ dil > """able, so 

quences W^-valid^ 

samples is an expensive and Iaborio£ proci ' SSft'^ft 
.. so confirmation on.multiple data sets mayS ? :^o^f^ ^ :P? T^^ again ' 

computational analysis) n *• * ' ' 

,cies ,oi the six possible base changes m - and the probability ( ' • " ; P " ' 
each set of SNPs (TM P \e.\ p™,„„„ v. i ' .. H .. •"^•^ 1 



ates slightly from the .2:1, transition-to- \Celera-Pn>. comparison, F = 29 73 > < 

•0.0001). •/ ' ' ' 

:.; ; Mverage diversity for the autosomes es- 
timated from ..the Celera-PFP.; comparison 
'was 8.94 X10- 4 . Nucleotide diversity on 
i the X chromosome was 6.54 -X I o'~. 4 ,- The 
X is expected to be less variable than au-» 
tosomes, because for every four copies of 
.. autosomes in the population, there are only 
imree X chromosomes, and this smaller ef- 
. fective population size means that random 
, dr ift will < more .rapidly = remove/variation . • 

from the X (106). 
:•- Having , ascertained nucleotide yariation 
.genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (101, 102,106, 107).. Genome-wide, 
our estimate of : nucleotide diversity was 



, trans version ratio .observed, in the. other 
SNP sets. This result is not unexpected, 
. because , some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2:1 transition:transversion ratio for the 
bona Tide SNPs would be obtained if one 
assumed that .15% of the sequence differ-, 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. . 

6.3 Estimation of nucleotide diversity " s 
from ascertained SNPs 

The number of SNPs identified : varied . 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used it, the 
standard statistic for nucleotide . diversity 
(104). Nucleotide diversity is i-a measure, of 



Imp * l\ of SNPs from genome-wide 

SNP databases. Table entries are SNP counts for 
each pair of data sets. Numbers in parentheses are 
the fraction of overlap, calculated as the count of 
overlapping SNPs dWtded by the number of SNPs 
T n ^ e c22 aUer ° f the Abases compared. 
Icl , ^L C ^ UntS f0r the da t^ases are: Celera- 
PFP, 2,104.820; TSC, 585,811; and Kwok 438,032, 
Only unique SNPs in the TSC and Kwok data sets 
were included. 



probability that , a pair of chromosomes 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
in methods like reduced respresentation se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



• densely resequenced 
8.00 X 10- 4 (108). 



• human" genes was 



6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes . in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- 



TSC 



Kwok 



Celera-PFP 
TSC 



188,694 
(0.322) 



158,532 
(0362) 

72,024 
(0.164) 



Table 16. Summary of nucleotide changes in different SNP data 


sets. 






SNP data set 


A/G 
(%) 


C/T 
(%) 


A/C 
(%) 


A/T 
(%) 


C/G 
(%) 


T/G 
(%) 


Transition: 
transversion 


Celera-PFP 

Kwok* 

TSCf 


30.7 
33.7 
33.3 


30.7 
. 33.8 
33.4 


10.3 
8.5 
8.8 


8.6 
7.0 
7.3 


9.2 
8.6 
8.6 


10.3 
8.4 
8.6 


1.59:1 
2.07:1 
1.99:1 
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By this measure, the .duplication segment : 

■ spans nearly half of each chromosome's net j 

/.length. The most likely scenario. is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 

; ; such, at least four subsequent rearrangements 
; would need to be invoked to explain ^ the :_ ; 

; relative insertions and inversions seen in the ,. 
:1 4 .duplicated segment interval. ;The 64 protein;. 
: • pairs in this alignment . occur among 217 pro- \ 

„... • tein assignments on chromosome 18, ; and v ; 

• , ; among 322 protein assignments on chromo-, 
. . ? , some 20, for a density of involved proteins of . 
, 20 to 30%. -This is consistent with an ancient) 

^ large-scale duplication followed by subse- ; 
quent gene loss on one or both chromosomes. : 
Loss of just one member of a gene pair 
, subsequent to the duplication would result in . 
afailure to score a gene, pair in the block; less ; 

- than 50% gene loss on the chromosomes ; 
-would . lead to the duplication density ob- 
served here. As' an independent verification . 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of ... 
the pairs of aligning proteins in this duplica- : 
tion, including some of those annotated (Fig. : ' 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 

: logs; their relative scarcity within the genome 

. ' validates the uniqueness and robust nature of 
. their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMM (Online Mendelian Inher- 
itance in Man) assignments, are members _of 
duplicated segments (see web table 2 on Sd-, 
ence Online at www.sciencemag.org/cgi/con- 
tent/fuliy291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 
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. pair of duplicated .chromosome regions was 
^observed in many , compared regions. Hypothe- 

• V ses . to explain which mechanisms foster these 
. processes must be tested. V 

■ ■.. ., \ . . Evaluation . of ; the alignment results . gives 
some perspective on dating of the. duplications?' 
•AAsnoted above/large T scaIe. ancient, segmental 

• -duplication intact test;exp^ 
> blocks detected by this, genome-wide analysis 

vkThe ^regions of human chromosomes involved 
I - m upon 
v. above (chromosomes 2 to 14, 2 to 12, and 18 to 

• 20) are each syntenic toa distinct mouse chro- 
l ; mpsomal , region- .The corresponding mouse 
. - • chromosomal regions are much more similar in 

sequence conservation, and even in order, to 
;. ; ; their .human synteny. partners than -the human 
. duplication regions^are to each other. Further- ■ 
.4 " the corresponding mouse chromosomal regions ; 
. each , bear a significant proportion of genes or- - 
r thologous to the human genes * on which the 
: ; human duplication assignments were made. On 
the basis, of these factors,; me. corresponding 
, mouse,;chromosomal spans, at coarse resolu- 
7. tion, appear to be products of the same large- 
r scale duplications observed . in humans. Al- . 
though further detailed analysis must be carried 
: out once a more complete genome is assembled 
for mouse, the underlying large duplications 
.,• appear to predate the two species': divergence. v 
' -This dates the duplications, at the latest, before : 
.. ••divergence of the primate and rodent lineages. 
, This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, pufTerfish (Fugu 
rubripes), or zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
.. duplications are restricted to the Hox cluster 
; regions.: When the synteny of .these regions/ 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



;V\veal the/stagewise history of our genome, and 
V: with it a history 6f the emergence of many of 
v ; : me ke/.M us from other 

. living things. 

':tf. r 6 A Genome-Wide Examination of 
>V -Sequence Variations 

.; ; methods were usc^ • 

to identify single-nucleotide polymorphism* 
f, >ii (SNPs) by'cbmparisdnbf the Celera sequence, 
to other, SNP^ resources. The SNP rate be- 
M tween two chromosomes was —1 per 1200 to 
1500 bp. SNPs: are distributed nonrandomly 
. / throughout the genome. Only a very small 
proportion ; of >11 :SNPs (<1%) potentially 
impact protein ^function based on the func* 
:^4ionah analysis of SNPs that affect the pre- 
via dieted 'Coding regions. 'This results in an cs* 
> itimate thai only thousands, not millions, of 
• / j genetic variations may contribute to the struc- 
v } rural diversity of human proteins. 

s \Vi Having a complete genome sequence cnahlci 
: : researchers to achieve a dramatic acccicnrtjpn 
• in the rate of gene discovery, but only through 
. analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
./ sequencing is a particularly effective method 
. for detecting sequence variation in tandem with, 
/. whole-geripme assembly. In addition, we com- 
: ■ : pared > the ; distribution and attributes of SNPs 
.. • ascertained by three other methods: (i) aiign- 
.ment of the Celera consensus sequence to the 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (P7), and (iii) reduced repre- 
sentation shotgun sequencing (reteed to as 
s ,'TSC^; 632,640 SNPs) (98). These data were 
V , consistent in showing an overall nucleotide di- 
versity of -8 X I 0~ 4 , marked heterogeneity 
across the genome in SNP density, and an 
oveiwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PH * s 
sembly). First, all sequence differences betwee 
the two consensus sequences were identiti e . 
these were then filtered to reduce the contnn 
tion of sequencing errors and misassembiy. , * 
a measure of the effectiveness of the filtcni b 
step, we monitored the ratio of traiisition 
transversion substitutions, because a 2:1 
has been well documented as typical in rn. 
malian evolution (100) and in human n 
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termined to be in the same family and tl 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes 
ordered by ; the .start codons for predicted 
, genes along the chromosome! We considered 
the two ^ strands as a single string, because ; 
Mocal inversions are relatively common events 
relative to' large-scale duplications., Each 
gene was indexed according to the protein 
i. family, and JLek complete cluster (89), All ; 
pairs, of mdexed gene strings were then 
aligned in both the forward and reverse di- 
rections with the Smim-y/aterman algorithm , , 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch -10, with gap open 
and extend penalties of -4 and -1. With 
: . these parameters, 19. conserved interchromo- . ; 

. somal blocks of duplication were observed, 'y 
.. all of which were also detected and expanded > 
.. by the comprehensive method described be- 
\ low. The.detection of only a relatively, small : . 
number o£ block, duplications was a conse- \» 
. quence of using an intrinsically conservative 
/method grounded in the conservative con-, 
straints of the complete Lek clusters. .'" : ,> . ! 

In the second, more comprehensive ap- 
* proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
very, rapidly; for example, two chromosomes 
of 100 Mbp can be aligned in less than 20... 
min (on a Compaq Alpha computer) with 4 \ 
gigabytes of memory. This . procedure . was , 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
_ large duplicate^ ^ sejments. -For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million ammo acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by, < the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 
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filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into .24 shuffled chromosomes, .each 
- containing the same number of proteins as the. 
.- true genome/This shuffled protein set has the 
y. identical composition to. the real genome; in . 
particular, every .protein and every domain 
appears the same number of times.YThe com- 
plete algorithm was then applied-to both the 
real and the shuffled data, with the results on 
the shuffled .data, being Used to estimate the ■ 
false positive rate, The algorithm after filter- ' 
i^'^ti^'iO^lf^gmjfm in 1077 dupli-y 
cated blocks containing 3522 distinct genes; . 
; tandemly duplicated expansions in many of . 
' the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled, data, by con- 
% trast, only 370. gene pairs were, found, giving 
,a/alse : positive . .estimate of 3.6%:.;The most \ 
\ likely, explanation for . the 1 077 block dupli- / 
■cations is ancient segmental duplications. ,In 
many cases, the order of the proteins has been ; 
shuffled, although proximity is preserved. • 
Out /of . the .1077 blocks, .159 , contain only ; \ 
three, genes, 137 contain four genes, and 1%\ 
contain, five or more genes 
... .To. illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
,24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are; ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 



.. ;jis at several evolutionary stages (94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
:more detected large-scale; duplications .than 
;:ptfaers:;inde^*-6ne-^of.tiidll^est duplicated 
Yj segments is a largevbloclc of 33 proteins on 
.^chromosome :2; spread among; 'eight smaller 
\ ' blocks in 2p, that aligns to a paralogous set on 
: - .i chromosome 14, with one rearrangement (see 
% chromosomes 2- and 44 panels* in Fig. 13). 
. . .The proteins are not contiguous but span a 
. region (.containing 97 proteins on chromo- 
some-2 and 332 proteins on chromosome 14. 
fv The likelihood of observing this many dupli- 
■ cated proteins by chance,- even over a span of 
this length, is 2.3 X 10" 68 (Pi). This dupli- 
... cated set spans 20 Mbp on chromosome 2 and 
; 63 Mbp on chromosome 14, over 70% of the 
i\. latter chromosome.- Chromosome 2 also con- 
,j tains i a block . duplication . that : is nearly as 
. large, which is shared ,by chromosome arm 2q ' 
■*. : and chromosome 12;sThis duplication incor- " 
r; porates - two: of the four .known Hox gene 
£ clusters, but considerably expands the extent 
* of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
y duplication is also seen on the two chromo- 
v somes carrying the other two Hox clusters. 
.: An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the . features 
common to many of the other observed large 
duplications (Fig. . 13, mset):.This duplication 
contains 64 ■ detected, ordered intrachromo- 
somal pairs of homologous genes. After dis- 
v counting a 40-Mb stretch of chromosome 18 
.. free of matches to chromosome 20,- which is 
V likely to represent a large insert (between the 
. gene assignments "Krup rel" and "collagen 
rel"on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 



700 n 



I Human/Worm 
1 Human/Fly 




5:1 4:1 3:1 
human predominant 



1:1 
Ratio 



1:2 



1:3 1:4 1:5 
fly/worm predominant 



Fig. 12. Gene duplication in complete protein clusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27). The numbers of dusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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that account for. gene mactivation. The gen-, 
eral structural characteristics of these pro- 
cessed pseudogenes . include . . the complete 
lack of intervening sequences found in the 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene. sequence. Processed pseudogenes pc- : 
cur as a result of retrotransposition, whereas 
unprocessed pseudogenes arise from segmen- . 
tal genome duplication. 

We searched the complete set of Otto- 
-predicted transcripts against the. genomic se- 
quence by means of BLAST.: Genomic 're- 
gions corresponding . to all lOtto-predicted 
. transcripts were excluded from this analysis. 
We identified 2909 regions, matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- ' 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. • 

We looked for ^correlations between 
structural elements and . the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 
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- pseudogenes (1 1 77, source genes) - versus 
;: • ithe -remainder .of the predicted gene set. 
^Transcripts that give rise to processed pseu- 
. dogenes ■ have .* shorter . average transcript 
: length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
^Jpseudpgene was detected; The overall GC: 
content. did not show-any significant differ- 
-encej contrary to a recent report (88). There 
; is: a. .clear trend in ; ; gene families that are 
present as processed pseudogenes These 
include' ribosomal ; proteins \ (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins. (2%). The increased ^occurrence of 
;^firetrotransposition;0)pth intronless paralogs 
' and , processed pseudogenes) among genes 
'involved in translation and nuclear regula- 
tion may reflect an increased .transcription- . 
v • al. activity of these genes! . ".. 

,.. 5.3 Gene duplication in the human 
genome ' [: 

: Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human, protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the 
whole genome (2.9-Gbp sequence length) by means of two different methods. Method 1 uses a CG 
likelihood ratio of 2:0.6. Method 2 uses a CG likelihood ratio of 2=0.8. , ■■ 



Chromosome 22 



Whole genome 
(CS assembly) 



Method 1 


Method 2 


Method 1 


. Method 2 


Number of CpG islands 5,211 


522 


195,706 


26.876 


detected 








Average length of island (bp) "~ 390 


535 


395 


497 


Percent of sequence 5.9 


0.8 


2.6 


0.4 


predicted as CpG 




42 


22 


Percent of first exons that 44 


25 


overlap a CpG island 




40 


21 


Percent of first exons with 37 


22 


first position of exon 








contained inside a CpG 








island 








Average distance between 1,013 


10,486 


2,182 


17,021 


first exon and closest CpG 








island (bp) 






55,811 


Expected distance between 3,262 


32,567 


7,164 


first exon and closest CpG 








island (bp) 








Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence. 




Megabases in 


Percent 


Previously 


Repetitive elements 


assembled 


of 


predicted 


sequences 


assembly 


(%) (83) 


Alu 


288 


9.9 


10.0 


Mammalian interspersed repeat (MIR) 


66 


2.3 


1.7 


Medium reiteration (MER) 


50 


1.7 


1.6 


Long terminal repeat (LTR) 


155 


5.3 


5.6 


Long interspersed nucleotide element 


466 


16.1 


16.7 


(LINE) 








Total 


1025 


35.3 


35.6 



The complete clusters that result from the 
y Lek clustering provide one basis for compar- 
v iing the role of whole-genome or chromosom- 
':: al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Etecause each, complete cluster rep- 
.. • resents .a closed and certain island of homol- 
\ ; Pgy, and because Lek is capable •■' of simulta- , 
; neously .clustering iprotein /complements of 
several organisms, the number ' : . of proteins 
, contributed by . each organism to a complete 
^cluster can_.be predicted with confidence de- 
. pending on the quality of the . annotation of 
each genome. The variance of each organ- 
; ism's contribution to. each cluster can then be 

• ; :xaiculated, allowing an assessment of the rel- 
K ative ; importance, of -large-scale Iduplication 

versus Vsmaller-scale,' prganism-specific ex- 
>;vpansion and. contraction of protein families, 
presumably; as a result of natural selection 
operating on individual protein families with-' 
in an organism. As can be seen in Fig. 12, the 
. large variance in the relative numbers of hu- 
. : man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 
:'■ clusters may be explained by multiple events 
of relative expansions in gene ■ families in 
each of the three animal genomes. Such ex- 
pansions would , give rise to the distribution 
that shows a. peak .at 1:1 in the ratio for 
■ human-worm or human-fly clusters with the 

• slope spread covering both, human, and fly/ 
worm, predominance, as iwe ^observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 

. dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 

, individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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Otto-predicted, single-exon genes were suu 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% /sequence identity over 90% of the 
length, we identified 298 instances of single-, 
vto .multi-exon correspondence. Of these 298 
sequences, 97 .were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified ' 
and were verified by .manual inspection. ;■'!-. . 

We believe. that these 97: cases may rep ; f 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ - 
content/full/291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- •;. 
dence contain polyadenylated [poly(A)] tails " 
characteristic of retrotranspbsition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs . . 
speculate that retrotransposition may serve as 
- a mechanism used to. escape X-chromosomal 
inactivation (84, 86). We do not find a bias 
toward X chromosome origination of these * "\ 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs.' We also have found sev- 
eral cases of retrotransposition from a single 
. source chromosome to multiple target chro-: • : 
mosomes. Interesting examples include the / 
retrotransposition of a five exon-containing 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is' 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
- -route, * retrdtrarisposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of genes involved in translatibnal 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue-specific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 
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5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a. normal gene but that has 
been .-altered slightly so.; that it is not ex- 

- Table -11. 'Genome -overview. * 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
_) elucidating, the ongoing evolutionary forces 



Size, of the genome (including gaps) 

Size of the genome (excluding gaps) ' 
Longest contig 
>: Longest scaffold 
• - Percent of A+T ln the genome 

. , Percent of G+C in the genome ' 
; •. Percent of undetermined bases in the genome 
. Most GC-rich 50 kb 
Least GC-rich 50 kb 
Percent of genome classified as repeats 
. Number of annotated genes 
Percent of annotated genes with unknown function 
Number of genes (hypothetical and annotated) . 
;< Percent of hypothetical and annotated genes with unknown function 
Gene with the most exons t 
Average gene size 
Most gene-rich chromosome 
Least gene-rich chromosomes 

Total size of gene deserts (>500 kb with no annotated genes) ■ 
Percent of. base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by introns 
Percent of base pairs in intergenic DNA 

. Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest proportion of DNA in annotated exons 
Longest intergenic region (between annotated + hypothetical genes) 
Rate of SNP variation 

:*ln these ranges, the percentages correspond to the annotated gene set (26. 383 genes) and the hypothetical + 
annotated gene set (39,1 14 genes), respectively. 

Table 12. Rate of recombination. per physical distance (cM/Mb) across the genome.- Genethon markers 
were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA, not applicable. 
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Fig. 11 (continued). Relation among gene density (orange). G+C content dows. The percent of G+C nucleotides was calculated in 100-kbp 
(green), EST density (blue), and Alu density (pink) along the lengths of : windows. The number of ESTs and Alu elements is shown per 100-kbp 
each of the chromosomes. Gene density was calculated in 1-Mbp win- window. 



5.1 Retrotransposition in the human a duplication event. The existence of both events in cellular biology. Identification of 
genome intron-containing and intronless forms of conserved intronless paralogs in the mouse 
Retrotransposition of processed mRNA genes encoding functionally similar or or other mammalian genomes should pro- 
transcripts into the genome results in tunc- identical proteins has been previously de- vide the basis for capturing the evolution- 
tional genes, called intronless paralogs, or scribed (84, 8 J). Cataloging these evolu- ary chronology of these transposition 
inactivated genes (pseudogenes). Aparalog tionary events on the genomic landscape is events and provide insights into gene loss 
refers to a gene that appears in more than of value in understanding the functional and accretion in the mammalian radiation, 
one copy in a given organism as a result of consequences of such gene-duplication A set of proteins corresponding to all 901 
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examined. Unfortunately, too few meioi;i 
crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any . 
. finer than about 3 Mbp: The next. challenge " 
rwill.be to determine .a - sequence basis , of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in \ 
recombination, rates ..between any pair of 
markers would be extremely useful in design- v- 
.ing markers to narrow a region of linkage, y 
such as in positional cloning projects, v: i. 
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43 Correlation between CpG islands , 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74). CpG islands are. believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most • 
housekeeping genes have CpG islands at the 
5' end of the . transcript (75, 76). In addition, ; . 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (78) and .■ 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to - 
45,000 CpG. islands in the human genome 
(74, 80) and an estimate of 499 CpG islands 
on human chromosome 22 (81). Larsen et 
al. (76) and Gardiner-Garden and Frommer 
(75) used a computational method to iden- '. 
tiry CpG islands and defined them as re-'-" 
gions of DNA of >200 bp that have a G + C . 
content of >50% and a ratio of observed 



versus expected frequency of CG dinucle- 
otide >0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG is- 
lands with -computational /definitions be 
cause, computational methods do not con-! 
sider, the. methylation state of cytosine,and 
experimental methods do not directly select 
regions of high G+C content. However, we; 
can determine tiie correlation of CpG island 

genomic transcripts and the whole genome 
sequence We /have analyzed the publicly 
available annotation of chromosome 22, as 
well as using the entire human genome in 
' our assembly and the computationally an- 
; notated genes. A .variation of the CpG is- ' 
land. .^computation was compared, with; 
... Larsen et al (76 ). The main differences are , 
that - we use a sliding .window of 200. bp, 
consecutive - windows are merged only if • 
. they /overlap, . and we recompute the CpG ' 
value upon merging, thus rejecting any po- 1 
' tential , island if it scores . less than the 
threshold. " 

. To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
; otide likelihood ratio. Besides using the orig- ,- 
inal threshold of 0.6 (method 1), we used a 
higher , threshold, of CG dinucleotide likeli- ; 
hood ratio of 0.8 (method 2), which results in 
the number of CpG . islands on chromosome 
i 22 close to the number of annotated genes on 
this chromosome., The main results are sum- 
marized in Table 13. CpG islands computed 
with method 1 predicted only .2.6% of the 
CS A sequence as CpG, but 40% of the gene 
starts (start codons) are contained inside a 



CpG island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table, show the. observed « and expected 
^.average, distance, respectiyely^bf the closest 
. ; ;CpG island from the first exonHThe observed 
^average ^ closest CpG, islands are smaller than 
the, corresponding expected ^distances, con- 
i/j firming an association between. CpG island 
and the first exon. 

We also looked at the. distribution of CpG 
^ island nucle^tjdes amohg Various sequence \ 
classes such as intergenic regions, introris, 
exons, and, first exons We .computed the 
likelihood score for each sequence class as 
; - the ratio of the observed fraction of CpG 
. island nucleotides in that sequence class 
and, the expected . fraction of CpG island 
. nucleotides; in that sequence class. The re- 
; suit of applying method A on CSA were 
,.: scores of 0.89/fqr intergenic region, 1.2 for " 
\ intron, 5.86 for' exon,<and ; .13.2 for first * 
,exon. The same trend was also* found for 
chromosome 22 arid after the application of 
a higher threshold (method 2) on both data 
, sets.. In .sum/. genome r wide analysis . has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
.first coding exons. 
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4.4 Genome-wide repetitive elements 

The proportion of the genome .covered by 
various classes of repetitive DNA is present- 
ed in Table 14. .We. observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
itive sequence may be. :underrepresented in 
' the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as -well as the highest"" 
gene density (Table 10). Of interest, among 
- the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome! 
evolution can be captured at several levels. 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all introniess para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Examination of pericentromeric regions is 
ongoing. 

The remaining -80% of the genome, . the 
t dichromatic component, is . divisible into G-, 
. . R-, and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, . although we 
have been unable to determine precise band 
boundaries at the molecular level. T-bands are 
, the most G+C- and gene-rich, and G-bands are 
G+Opoor (o^Bernardi has also offered a 
. description of the euchromatin at the molecular : 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300 kbp in length 
(69). Bernardi defined the L (light) isochores as 
G+C-poor (<43%), whereas the H (heavy) • 
isochores fall into three G+C-rich classes rep-" 
resenting 24, , 8, and 5% of the genome. Gene 
• concentration has been claimed to be very low k 
. in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By examining 
contiguous 50-kbp windows of G+C content J. 
, across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was. 
1078.6 kbp. The correlation between G+C 
^content and gene density was also examined in . 
p-kbp windows along the assembled sequence / - 
Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3-containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 



ft 
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found to have the lowest gene density, X, 4, 
.18, 13, and Y, also have the fewest H3 bands. 

'\ i Chromosome 15, which also jhas ,few,H3 

...bands, did not. have a particularly low. gene 
density; in our analysis! In addition, chromo- 
; some 8, which we found to have a low gene 
.density, does not appear, to be unusual in its 

■'" H3 banding. • 
How iyaUd ,is .Ohno's postulate (71) that 

"y. -mainmalian genomes consist, of pases of genes 
in otherwise essentially empty deserts? It ap- 

;: pears that the.human genome does indeed con- 
tain deserts, or large, gene-poor regions. If we 

• define a desert as a region >500 kbp without a 

. gene, then we see that 605 Mbp, or about 20% 
of the genome, is in deserts. These are not 
.uniformly distributed over, the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only : about 12% of their , collective 171 : 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13,. 18, and X have 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
analysis and are widely used in the study of the , 
inheritance of traits and in the positional clon- 
: ing of genes. The distance metric, centimorgans 
(cM), is based ; on the recombination rate be- 
tween homologous chromosomes during rheio- ., ; 

Table 9. Characteristics of C+C in isochores. 



,. sis.. In . general, .the rate of recombination in 

• .. .females is'greater;man .that' in males, and this 

degree, of map expansion is not uniform across 
: the genome (72). One of the opportunities en- 

- •>_ abled by a nearly complete genome sequence is 
_.;to;produce the ultimate physical map, and to 
./^fully analyze its correspondence with two other 

t r ; maps that have been ^; widely used ; m : genome 

^ : --and:genen;p .^a^ 

■^cytogenetic : map/vjrus v ^u^ loop 
. ^between the mapping and sequencing phases of 
j ;the genome project. 

; S We mapped - the location of the markers 
; that constitute the Genethon linkage map to 
.the genome. -The rate, of recombination, ex- 
. pressed as cM per ■ Mbp, .was calculated for \ 

• :3-Mbp windows as shown; in Table 12. High- 
er. jates , of recombination . in the telomeric 

- ;region ; of the chromosomes have been previ- 
ously documented (73). From this mapping 
result, there is a difference of 4.99 between 

r lowest rates and highest rates and the largest 
; difference pf,4,4 between, males and females " 
(4.99 to 0.47.on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 

• the differences in . recombination rates be- 
tween males: and females.. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
a space of 1 kbp, so the picture bne gets of the 
magnitude of - : variability in recombination 

. rate will depend on . the size of the window 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
in the two-exon cate- 
gory, but the de novo 
gene predictions are 
;ewed much more 
rard smaller tran- 
ipts. In the Otto set, 
9.7% of the tran- 
scripts have one or 
two exons, and 5.7% 
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have more than 20. In the de novo set, 49.3% of the transcripts have one or two exons, and 0.2% have more than 20. 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on . 
: » the basis of matches to known genes, 1 1,226 . 
i . transcripts predicted by Otto based on homol- : ; 
v ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
7: idence.' The 26,383 genes are illustrated albng 
chromosome diagrams m Fig 1 These are a 
very preliminary set of annotations and are 
^ subject to all the limitations of an automated . 

process. Considerable refinement is still nec- . 
: essary to improve the accuracy of these tan-.', 
.script predictions. All .the predictions and 
descriptions of genes and the associated evi- ; 
. dence that we present are the product of 
completely computational processes, not ex- 
: pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
. way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- . 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on the 
average span covered by RefSeq tran- 

: scripts, used because it represents our high- . . 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 

: those-- ppomoted-from -gene-prediction -pro- 
grams average , about 3.7 exons; The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 
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port , the Otto and other predicted transcripts. 

For example, one can see that a typical Otto 

.transcript has 6.99 of its 7.81 exons supported 
,;by protein homology evidence. As would be 
; expected, the Otto transcripts generally have 
. more support than do transcripts predicted by 

the de novo methods. 

A Genome Structure > 

Summary! This section describes several of 
the noncoding attributes of the assembled 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C, content and gene density. in the 
: context pf cytogenetic maps of the genome, 
an eriumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. , 



4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
most visible; element .of; the structure of 
»?, ;the genome is the banding pattern produced 
•; by Giemsa j stain. 'Chromosomal banding 
v studies have revealed that about 17% to 
.20% of the human chromosome comple- 
. .ment. consists , of C-bands, or constitutive 
,• heterochromatin (^).iMuch iof this hetero- 
chromatin is -highly polymorphic and con- 
sists of different families of alpha satellite 
DNAs ; with vanous higher order repeat 1 
'. : ; structures (65). ' Many ..chromosomes have 
• complex inter- and intrachromosomal du- 
: .plications present in pericentromeric *re- 
.'■ gions (66). About 5% of the sequence reads 
../.were identified as alpha satellite sequences; 
. these were not included in the assembly. 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data .show k the degree to which 
multiple 'Genscan predictions and/or ; Otto annotations -Were associated "With 'a single RefSeq 
transcript. The zero class for the Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions). - 







Total 




- Types of evidence 






No. of lines of evidence* 




Mouse 


Rodent 


Protein 


Human 


2:1 


2:2 


2:3 


2:4 


Otto 


Number of 


17,969 


17,065 


14.881 


15.477 


16,374 


17,968f 


17,501 


15,877 


12,451 




transcripts 
















Number of 


141,218 


111,174 


89,569 


108,431 


118,869 


140,710 


127,955 


99.574 


59,804 


De novo 


exons 












Number of 


58,032 


14,463 


5.094 


8,043 


9,220 


27,350 


8,619 


4,947 


1.904 




transcripts . 














Number of 


319,935 


48,594 


19.344 


26,264 


■ 40,104 


79,148 


31,130 


17,508 


6,520 


No. of exons per 


exons 
















Otto 


7.84 


5.77 


6.01 


6.99 


7.24 


7.81 


7.19 


6.00 


4.28 


transcript 


De novo 


5.53 


3.17 


3.80 


3.27 


4.36 


3.7 


3.56 


3.42 


3.16 



Four kinds of evidence (conservation in 3X mouse genomic DNA. similarity to human EST or cDNA. similarity to rodent EST or cDNA. and similarity to known proteins) were 
considered to support gene predictions from the different methods. The use of evidence is quite liberal requiring only a partial match to a single exon of predicted transcript. tThis 
number includes alternative splice forms of the 1 7.764 genes mentioned elsewhere In the text 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- : 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence regions 

■represented by. the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated This , 
procedure simplified, the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 

. rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned , a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. •> 
The final Genscan predictions were often quite 

. different from the prediction that Genscan re- :- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare, 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 1 0 bases, but the 
external edge was allowed greater latitude to 
allow for 5'. and 3'/ untranslated regions 1 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of 4< hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must ^ e 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 



Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number (A/) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript. Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology)t 


0.604 


0.884 


Genscan 


0.501 


0.633 



•Refers to those annotations produced by Otto using only 
the Sim4-polished RefSeq alignment rather than an evi- 
dence-based Genscan prediction. t Re f ers to those 
annotations produced by supplying all available evidence 
to Genscan. 
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those that passed were promoted to Otto 
^predictions. Homology-based .Otto .predic- 

.tions do not contain 3' and 5' untranslated,- ^ 
. sequence. Although three de novo gene-finding . 

programs [GRAIL, Genscan, and •: FgenesH < > 

(63)] were run as part of the ^computational 

analysis, the results of these programs were not ... 

directly ; used in making the Otto: predictions. \M>.i 
„ Otto predicted 11,226 . additional genes by , 

means of sequence similarity. . ?• 

3.2 Otto validation 

To validate ; .the Otto homology-based process 
..and the method that Otto uses to define the 7 
; structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- ■-, 
«ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
. was a unique SLM4 alignment (Table 7). In 
•> order to evaluate the relative performance of > 
, Otto and Genscan, we made three comparisons. ■■■■ 
The first involved a detenriination of the accu- 
racy of gene models predicted by Otto with . w 
; .only.homology data other than the correspond- 7 
. ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
. dieted bases divided by the total length of the 
xDNA) and specificity (correctly predicted 
, bases divided by the sum of the correctly and 
: incorrectly predicted bases). Second, we exam- 
; ined the sensitivity and specificity of the Otto 
predictions that were made solely with the Ref- r \ 
: . Seq sequence, which is the process .that Otto .r V 
uses to annotate known genes (Otto-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
. and Otto-homology performed better than Gen- 
scan by both criteria. Thus* 6,1 % of true RefSeq • 
. nucleotides were not represented in the Otto- : V? 
* refseq annotations and 2.7% of the nucleotides % 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that . 
cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript. We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing 4hat,the, Otto system is quite 
conservative, we used a different gene-pre- 
diction strategy in -regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence. to be. included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This final class' of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe- 
line. For these, there. was not sufficient 
sequence similarity information for Otto to 
attempt 7 to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 71 55,695 predictions, of 
which ~76,4 10 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did noUoverlap known- genes or 
predictions made by Otto.'; Only 2 1 ,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

- The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for the human gene 
complement. ■ As seen in Table. 8, if the re- 
quirement for other : supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to ~23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members ; of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence types— homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- . 
tion .between mouse, and human genomic. 
DNA, similarity to human transcripts (ESTs 
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Fig. 7. Schematic view of the distribution of breakpoints and large gaps 
on all chromosomes. For each chromosome, the upper pair of tines 
represent the PFP assembly, and the lower pair of lines represent Celera's 



assembly. Blue tick marks represent breakpoints, whereas red tick marks 
represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome is indicated in black, and the chromosome numbers In red. 
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gene boundaries. During this process, multiple , 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of . .. 
a region. For example, if a group of bases was v 
represented by multiple overlapping ESTs, the ;. 
union of these regions matched by the set of \ 
ESTs on the scaffold was marked as being ... 
supported by EST evidence. This resulted in a . 
series of "gene bins," each of which was be-, 
lieved to contain a single gene. One weakness of . .; 
this initial implementation of the algorithm was \ 
in predicting gene boundaries in regions of tan- : 
demly. duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 
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being joined together, resulting in an annotation 
that artificially concatenated these gene models. 
. Next, known genes (those with exact match- 
es of a full-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the ^ciliat- 
ed, human gene set RefSeq from the Nation- 
al , Center for . Biotechnology Information . 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



> .the region of the genome under analysis was 
promoted to the status of an Otto annotation. 

: Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 

: mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

.':«■'■ Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence; similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8 f 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DC1. 
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Lori breakpoints for the PFP assembly than 
Ke^S assembly. Figure 7 shows ■ 
f breakpoint map (blue tick marks) fa bo* 
* cLmblies of each chromosome m a side-by 
' detsWon The order and orientation of 

Sks to the CSA.assembly, the size of all 
STbave been estimated on *e *as.s of the 
8 ^ n Wdata Breakpoints can be caused by 

Shed nature of tAh genome.asse^lKS. ^ 
3 Cere Prediction and Annotation : 
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^A C ^;S^.o 
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fidence that were u sectlons 
initial computational approach. 



3.1 Automated gene annotation 

A eene is a iocus of cotranscribed exons. A 
!T«m may give rise to multiple tran- 

3 multiple functions, by means of altema- 
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sembly. against other finished sequence for 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a. 
statistical estimate derived from the quality 
values of the underlying reads. . . 

f The structural consistency of the assembly ; 
can be measured by mate-pair analysis. In a ■ 
, correct assembly, every ; mated pair of se- : 
quencing reads should be located on the con-, 
sensus sequence .with the correct separation, 
; and orientation between the pairs. A pair ;is . 
termed "valid" when the reads are in the 
correct orientation' and the distance between 
i.them is within the mean. ± 3 standard devi- : 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the. distance between the 
reads is not in the correct range but the reads . 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler ., was determined - as -described 
above. To validate . these, we examined, all 
reads mapped to the finished sequence of 
chromosome 21 (48) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- 
; merism (two different segments of the ge- , 
nbme cloned into, the same plasmid), and how 
tight the distribution of insert sizes was for 



those that were correct (Table 5). The;stan-i 
/ dard deviations for all Celera libraries were 
: quite small, less than. 15%. of the. insert J 

length, with the exception of a few 50-kbp 
. libraries. The 2- and 10-kbp libraries con- ; 
i, tained less than 2% invalid mate pairs, where-, 

■ as the 50-kbp libraries were somewhat higher 
; (-10%). Thus, although the mate-pair infor-, 
, mation was not perfect, its accuracy was such ; 
. v that ^measxiring valid, misoriented, and mis-.= 

separated pairs with respect to a given assem- -. 
• ■ bly was deemed, to. be. a reliable instrument 

• for validation purposes, especially when sev- . 

eral mate pairs confirm or deny, an ordering. 
: - r,. . The clone .coverage of the genome was 
.. ^^X, . meaning that any given base. pair was, 
, . on average, contained in 39 clones or, equiv- : 
v alently, spanned; by 39 mate-paired reads. 
. Areas of low clone coverage or areas with a 
■> high proportion of invalid mate pairs would . 

indicate potential assembly problems. ; We 

.computed the coverage of each base in the 
: assembly by valid mate pairs (Table 6). In 

■ summary, for scaffolds >30 kbp in length, 
. less than 1% of the Celera assembly was in 
... regions of less than 3 X clone coverage. Thus, 
<; more than 99% ; of the assembly, including 
: order and orientation, is. strongly supported 

by this measure alone. 

. . .We examined the locations and number of 
; all ; misoriented and 1 , misseparated -.mates. .In 
: * addition to doing • this analysis .on the CS A * 

assembly, (as :of 1 ^October 2000), we also ■ 

■performed' a study of the PFP assembly as of 



_£t5 /.September 2000 ^30, '; 55b), In this latter 

- case, Celera mate pairs had to be mapped to 
;the PFP assembly. To avoid mapping errors 
. due to; high-fidelity repeats, the only pairs 

mapped were those for which both reads 
... * matched at only one location with less than 
. 6% differences. A threshold was set such that 
.sets, of five or more simultaneously invalid • 

- -.mate pairs indicated a -potential breakpoint, : 
where the construction of the two assemblies 

■\: differed. The graphic comparison of the CSA 
»; chromosome 21 assembly with the published 
. sequence (Fig. 6A) serves as a validation of 
. this methodology. Blue tick marks in the 
panels indicate breakpoints. .There were a 
^similar (small) number • of - breakpoints on 
-.both chromosome- sequences. .The lexception 
-was 12. sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
« in .212 single-contig scaffolds) -that were 
. mapped to the wrong positions because they 
were too small to be mapped reliably. Figures 
: ,6 and 7 and Table 6 illustrate , the mate-pair 
differences and breakpoints between the two 
assemblies.There was a higher percentage of 
c misoriented and misseparated mate pairs in 
. the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies * 
; /simply because they span a larger segment of : 
(the genome^ The:- graphic ^comparison v: be- ^ 
r tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement, they were considered invalid (number of invalid mate 
pairs). 



Library 
type 



Chromosome 21 



Genome 



Library 
no. 


Mean 
insert 
size 
(bp) 


SD 
(bp) 


SD/ 
mean 

(*) 


No. of 
mate 
pairs 
tested 


1 


2.081 


106 


5.1 


3,642 


2 


1,913 


152 


7.9 


28,029 


3 


2.166 


175 


8.1 


4,405 


4 


11,385 


851 


7.5 


4,319 


5 


14.523 


1.875 


12.9 


7,355 


6 


9.635 


1,035 


10.7 


5,573 


7 


10,223 


928 


9.1 


34,079 


8 


64.888 


2,747 


4.2 


16 


9 


53.410 


5,834 


10.9 


914 


10 


52.034 


7.312 


14.1 


. 5,871 


11 . 


52,282 


7.454 , 


14.3 


2,629 


12 


46.616 


7,378 


15.8 


2,153 


13 


55,788 


10,099 


18.1 


2.244 


14 


39,894 


5,019 


12.6 


199 


15 


48,931 


9,813 


20.1 


144 


16 


48.130 


4,232 


8.8 


195 


17 


106,027 


27,778 


26:2 


330 


18 


160,575 


54,973 


34.2 


155 


19 


164,155 


19,453 


11.9 


642 



No. of 
invalid 
mate 
pairs 


% 
invalid 


Mean 
insert 
size (bp) 


SD 
(bp) 


38 


1.0 


2,082 


90 


413 


1.5 


1,923 


118 


57 


1.3 


2,162 


158 


80 


1.9 


11,370 . 


1 696 


156 


2.1 


14,142 


1,402 


109 


2.0 


9.606 


934 


399 


1.2 


10,190 


777 


1 


6.3 


65,500 


5,504 


170 


18.6 


53,311 


5,546 


569 


9.7 


51,498 


6,588 


213 


8.1 


52,282 


7,454 


215 


10.0 


45,418 


9,068 


249 


11.1 


53,062 


10,893 


7 


3.5 


36.838 


9,988 


10 


6.9 


47,845 


4,774 


14 


7.2 


47,924 


4,581 


16 


4.8 


152,000 


26,600 


8 


5.2 


161,750 


27,000 


44 


69 


176,500 


19,500 


2.768 


2.7 






(mean ■ 2.7) 









SD/ 
mean 
(%) 



2 kbp 
10 kbp 

50 kbp 



BES 



Sum 



4.3 
6.1 
7.3 

6.1 
9.9 
9.7 
7.6 
8.4 
10.4 
12.8 
14.3 
20.0 
20.5 
27.1 
10.0 
9.6 
17.5 
16.7 
11.05 



102,894 
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with GM99: These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was ordered unambiguously. 
, Next, all scaffolds that could be placed, 
but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same BAC cannot be • 
.^ordered relative to each other, but can be 
..-.assigned a placement boundary relative to 
other anchored , or ordered scaffolds. The 
• remaining scaffolds either had no localiza- , 
tion information/ conflicting information, 
or could only be. assigned to a generic 
chromosome location. Using the above ap- . 
proaches, ^-98% of the genome was an- 

■ ; chored, ordered, or bounded; . - 

Finally, we assigned a location for each , 
scaffold placed on the chromosome by >■ 
spreading out the scaffolds per chromosome. , 
We assumed that the remaining unmapped 
scaffolds, '.constituting 2% of the. genome, 
were distributed evenly across the genome/. . 
. By dividing the sum of unmapped scaffold 
lengths with the sum of the number of . . 
mapped scaffolds, we arrived at an estimate • 
. of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each - 
chromosome and to. assign an offset in the 
chromosome. . 

During the . scaffold-mapping effort, we en- 
. countered many problems that resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BAGs could not- be - assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
116,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene definition processes more difficult. 



THE HUMAN GENOME 
7 Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of. coverage of the genome) and 
correctness (the structural accuracy , of . the 
order and orientation and the consensus : se- 
• quence of .the assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
; represented .in the assembly.- This cannot be 
known with absolute certainty until the eu- 
chromatin sequence has been completed. 
However, it is. possible to estimate complete- ; 
ness on the basis of (i) the estimated sizes of . 

■ intrascaffold gaps; (ii) coverage of the two 
, published chromosomes, 2 1 and 22 (48, 49); 
, and (iii) ..analysis .of, the percentage -of an ; 
, independent set of random sequences (STS 

markers) contained in the assembly. The ; 
, whole-genome libraries contain : heterochro- " 
matic sequence and, although no attempt has 
., been made- to assemble it, there may be in- 

■ stances ofunique sequence embedded in .re- . 
. gions of heterochromatin as were observed in 

Drosophila (50, 51). 

. : , The sequences of human chromosomes 2liz 
: and 22 have been completed to high quality 
and published (48, 49). Although this se- 
. quence. served as input, to the .assembler, the - 
•finished sequence was shredded into a shot-'; 
gun data > set so that the "assembler had the . 
opportunity to assemble it differently from . , 
•/ the original sequence in the case of structural .. 
; polymorphisms or , ? assembly errors in the •■ 
; BAC data. In particular,, the assembler must v 
be able to resolve repetitive elements at the 4 
scale ;of : components v (generally ; multimega- . 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The . 
consequence of the flexibility- to assemble 
"finished" sequence, differently on the basis . < 
of Celera data resulted in an. assembly with 
more segments than the chromosome 21 and : 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness '„•'&> measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(5 J) to the. scaffolds. Because -these markers, 
: :; ..were not used in the assembly -processes, they 
.-provided a truly independent measure of com- 
c pleteness/ ePCR (53) .and BLAST <(54) were 
'. used to locate STSs.on;the assembled genome. 
We found 44,524 (91%) of the STSs in the 
.. mapped, genome. An additional .2648 markers 
V' (5.4%) .were found by searching .the unas- 
sembled data or "chaff ,? We identified 1283 
w- STS- markers (2.6%) not found in either. Celera - 

- sequence or BAC data as of September .2000, 
A; raising the possibility that these, markers may 

not be of human origin. If that were .the case, ; 
; the Celera assembled sequence would represent '.; 
. : 93.4% of the human genome and the unas- ; 
: • sembled data 5.5%, for a total of 98.9% cover- ; 

age. Similarly, - we compared^ CSA \. against ^ 
' 36,678 .TNG radiation hybrid ; markers (55a) 

- using the same method: We found that 32,371 
markers. (88%) .'were -located, in. the .mapped - 
CSA scaffolds, .with 2055- markers (5.6%) . 
found in the remainder. This gave a 94% cov- 

-erage of the genome . through another genome- 
wide survey. ■■• ■ ■ ' V = 5 v V, 

Correctness. Correctness is defined as the 
structural and sequence r.accuracy of the as- 
i sembly. Because me source sequences for the 
* Celera data arid the "GenBank data are from / 
. different individuals, we could not : directly 
compare the consensus sequence of the as-: ; : : 



•Table 4. Summary of scaffold :mapping.-5caf folds-' 
<were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and GM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their. placements -were. 'adjacent to a 
neighboring anchored .or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment. The scaffold subcategories are given 
below each category. 



Mapped 






% 


scaffold 


Number 


Length (bp) 


Total 


category * 






length 


Anchored 


1,526 


1,860,676,676 


? 70 


Oriented 


1,246 


1,852,088.645 


70 


Unoriented 


280 


8,588,031 


0.3 


Ordered 


2.001 


369,235,857 


14 


Oriented 


839 


329,633,166 


12 


Unoriented 


1,162 


♦ 39,602,691 


2 


Bounded 


38.241 


368,753.463 


14 


Oriented 


7.453 


274.536,424 


10 


Unoriented 


30,788 


94,217,039 


4 


Unmapped 


11,823 


55,313,737 


2 


Known 


281 


2,505,844 


0.1 


chromosome 








Unknown 


11,542 


52,807.893 


2 



chromosome 
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not covered by. a matching segment in the •„ . The CSA assembly was a few percentage ... v .....In order to, determine the effectiveness of 
other assembly. Some 82.5 Mbp of the WGA - points better in terms of coverage and slightly the fingerprint maps and GM99 for mapping 
(3.95%) was not covered by the CSA, where- more consistent than the WGA, because it : r scaffolds, we first examined the reliability of 
as 204.5 Mbp (8.26%) of the CSA was not v was in effect performing a few thousand shot- these maps by comparison with large scaf- 
covered by the WGA. This estimate did not gun assemblies of megabase-sized problems, v folds. vOnly 1% of the STS markers on the 10 
require any consistency of the assemblies or whereas the WGA is performing a shotgun v largest scaffolds (those >9 Mbp) were 
any uniqueness : of the matching segments, ./ assembly of a gigabase-sized problem. When k *. mapped./on >; a ; different -chromosome . on 
Thus, another . analysis was conducted in ; - one considers the increase of two-and-a-half . GM99. Two, percent of the STS markers dis- : 
which matches of less than .1 kbp between a : : ■ orders of magnitude in problem size, the m-- r<; agreed in position .by- more man five frame- J 
; pair of scaffolds were Excluded unless they- formation loss between the two is remarkably y.\work -bins/; However, f for : me^= fmgerprint * 
were confirmed by other matches having a > : : . small. Because CSA was logistically easier, to v. maps, a 2% /chromosome: discrepancy -was 
consistent order and orientation. This gives. - deliver and the better of the two results avail-- .observed,-, and on average. 23.8%: of ;B AC 
; some measure of consistent coverage: 1.982 -able at the time .when downstream analyses v- locations in. the scaffold -sequence disagreed 
' :■■ Gbp (95.00%) of the WGA is covered by the • needed to .be begun, all subsequent analysis with fingerprint map placement by more than 
CSA, and 2.169 Gbp (87.69%) of the CSA is . was performed on this assembly. .... five BACs. When further examining the 

covered by me WGA by mis more stringent . . • , . - source of discrepancy, it was found that most 

measure: ; : ; : v ; 2.6 Mapping scaffolds to th^ 10 

The comparison of WGA to CSA also : ;-yr The final s^ep m assemblmg.me genome w^ 
; permitted evaluation of scaffolds for structur- -: order and orient the scaffolds on the chromo- • : ■<>■ the quality of either the map or the scaffolds. 
•I al inconsistencies. We looked for instances in , s somes.- We first grouped scaffolds. together on * .All four.scaffolds were assembled, as well as 
which a large section of a scaffold from one : the basis of their order in the components from the other six, as judged by clone coverage 
. assembly matched only one scaffold from the ; CSA. These grouped scaffolds were reordered - analysis, and showed the same low discrep- 
■Vother assembly, but failed to match over the , ;. by examining residual .mate-pairing . data be- rancy rate to .GM99v and thus we. concluded 
full length of the overlap implied by, the • -tween the scaffolds. We next mapped the scaf-:-*¥that the fingerprint map global order in these 
.. matching segments. An initial set of candi- fold groups onto the chromosome using physi- . : r- cases was not reliable. Smaller scaffolds had 
; dates was identified automatically, and then ..... cal mapping data. This step depends on having ; & a higher discordance rate with GM99 (4.21% 
, each candidate was inspected by hand. From; <• reliable high-resolution map information such u," of STSs were discordant by more, than five 
this process, we identified 31 instances in ; that each scaffold will overlap multiple mark- . framework bins), but a lower discordance ^^^^ 
which the assemblies appear to disagree in a • ers. There are two genome-wide types of map j with the fingerprint maps (1 1% of BACs 
nonlocal fashion. These cases are being fur- . information available: high-density STS maps disagreed with fmgerprint maps by more than 
ther evaluated to determine which assembly . and fingerprint maps of BAC clones developed > /.five BACs). This observation agrees with the 
is in error and why. • at Washington University (45). Among the ge 7 r // clone coverage analysis (46) that Celera scaf- - 

In addition, we evaluated local mcoiisis- • . ' nome-wide; STS maps, GerieMap99 ^ (GM99) , ;: fold construction was / better : supported vby 
tencies of order or orientation. The following / -has the most markers and therefore was most ;.- long-range mate pairs in larger scaffolds than • 
results exclude cases in which one contig in useful for mapping scaffolds. The two different in small scaffolds. 

one assembly corresponds to more than one mapping approaches are complementary to one We created two orderings of Celera scaf- 
overlapping contig in the other assembly (as another. The fingerprint maps should have bet- folds on the basis of the markers (BAC or 
long as the order and orientation of the latter ter local order because they were built by com- STS) on these maps. Where the order of 
agrees .with the-positions they match in the parison of overlapping BAC clones. On the scaffolds- agreed between GM99 and the 
former). Most of these small rearrangements .other hand, GM99 should have a more reliable a .WashU BAC map, we had a high degree of 
involved segments on the order of hundreds long-range order, because the framework mark- . confidence that that order was correct; these 
of base pairs and rarely >1 kbp. We found a ers were derived from well-validated genetic \- - scaffolds were - termed "anchor scaffolds." 
total of 295 kbp (0.012%) in the CSA assem- maps. Both types of maps were used as a Only scaffolds with a low overall discrepancy 
blies that were locally inconsistent with the reference for human curation of the compo- rate with both maps were considered anchor 
WGA assemblies, whereas 2.108 Mbp nents that were the input to the regional assem- scaffolds. Scaffolds in GM99 bins were al- 
m U%-\ in the WfiA assemblv were incon- blv. but they did not determine the order of lowed to permute in their order to match 

WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was* determined by the 
presence of multiple 'mapped markers with . 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be ,< unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 



sistent with the CSA assembly. 



sequences produced by the assembler. 
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Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes, the percent of total 
sequence is indicated. 
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properly place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit. Of Celera's 27.27 million 
reads, 20.76 million matched a bactig and 
another 0.62 million reads, , which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAC because their mate matched the bactig. 
. Of the remaining reads, :2.92 million were 
v; completely screened out and so could not be 
; matched, but the other 2 97 million reads had 
■ unmasked sequence totaling 1.189 Gbp that 
were not found in . the GenBank . data set. 
r Because the Celera data are 5.1 1 X redundant,, 
we estimate that 240 Mbp : of unique Celera 
sequence is not in the GenBank data set. - 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence . 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling - 
-■" into -sets of overlapping and adjacent scaffold <; 
sequences in the next step. In. outline, ; the 
combining assembler first examines the set of < 
matching Celera reads to determine if there 
are excessive pileups indicative . of : un- 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
. have not been mapped to consistent positions ■•; 
are removed. Then all sets of mate pairs that 
consistently imply the same relative position ; 
of two bactigs are bundled into a' link and - 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 

single scaffoFd only if it is consistent with the" 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result. For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 



.issembly took place, but not enough Celera 
data were matched to truly assemble the 0.5 X 
to IX data set represented by the typical 
Phase . 0 BACs. The . combining assembler 

• was also applied to the Phase 3 BACs .for" 
SNP identification, confirmation of assem- 

: bly, and localization of the Celera reads. The 



C{ ric or contaminating sequence (from 
anomer part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
^ effect, the previous steps in the, CSA process 
served only to bring together .Celera frag- 
ments, and PFP. data relevant to. a. large con- 



phase 0 data suggest that a combined whole-> . tiguous segment. of , the. genome^ wherein we 



genome shotgun data , set and IX light-shot- 
gun of BACs will not yield good assembly of 
. - BAC .regions; at .least ,3X light-shotgun of * 
: ..each BAG is needed . . . v ; • : v.. 

The 5.89 million Celera fragments not 
: matching the GenBank data were assembled;.- 
with our :whole-genome assembler. The as- . 
...sembly "resulted. in a set of scaffoldYtotaling 
* ' 442 Mbp in span and consisting of 326 Mbp 
; of seiquence. More than 20% of the scaffolds .:■ 
. were. >5 kbp long, and these averaged 63% ^ 
sequence and 27% gaps with a total of 302 . 
Mbp of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
: by the combining^ assembler, to. the subse- 
quent tiling phase. " 

• * At this stage, we typically had one or two ■ 
; scaffolds for every . BAC region constituting 
r.-.at least .95% of the relevant sequence, and a 
. collection of disjoint Celera-unique scaffolds. . 
The next step in developing the genome com- 
ponents was to determine the order and over-. . 
lap tiling . of these BAC and Celera-unique 
' scaffolds across the. genome. For this, we 
used Celera's 50-kbp mate-pairs information, \ 
, and BAC-end pairs (18) and sequence tagged ... 
^ite (STS) markers (44) to provide long- . . 
:• range guidance arid, chromosome separation. 



Given the relatively manageable number of 
-• scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps arid the 
evidence for. each.*. A human curator could = 
then explore the implication of mapped STS • 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAG and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



* applied the assembler used for .WGA to pro- 
■ duce an ab initio assembly of the region. 
. - > ; v : . WGA. assembly of the components result- ■ 
: ed in a set of scaffolds totaling 2;906 ; Gbp in v . 

span and consisting of 2 654 Gbp of se- 
. quence. The chaff, or set , of reads not incor: v 
porated into the assembly, numbered 6. 17 
million^or .'22%.: More than ;90.0% : of the 
genome was covered by scaffolds spanning 
: >100 kbp long, and these averaged .92.2% .. 
•s sequence and 7.8% gaps with a total of 2.492 , 

Gbp of sequence. There were a total of 
v 105,264.gaps among the 107,199 contigs that 
I belong to the 1940 scaffolds, spanning > 100 - 
.kbp. The average scaffold size was 1.4 Mbp, .V 
the average contig size was. 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
- distribution - of sizes was .exponential. As 
f such, averages tend to ; be,underrepresentative ,. 
; of the majority of the data.. Figure 5 shows a '.; 
- histogram, of the bases in scaffolds of various 
♦size ranges. Consider i also : that more than 
49% of all gaps were' <500 bp long, more 
than 62% of all gaps were <1 kbp, and all 
gaps are < 100 kbp long. Similarly, more than . 
73% of the sequencers in contigs > 3 0 kbp, ■;■ . 
more than 49% is in contigs > 100 kbp, and ■ 
/the largest contig was 1 .99 Mbp long. Table 3 
^provides- summary statistics -for the structure , . 
of this assembly with a direct comparison, to 
the WGA assembly. , 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and. CSA), we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- . 
ing reads or bactig shreds) was ootained; this 
amounted to 2218 WGA scaffolds and 1717 . 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to detennine the 
number of bases of each assembly that were 
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rt some 22, all stones were placed conectly..^.'./*,Jiave.xequired a computer with a 600-gigabyte / - tribution of each- was essentially exponential 
• The final method of resolving gaps is to.- ;. : RAM. By making the Overlapper and Unitigger More than 50% of all gaps were less than 50( 
fill them with assembled B AC data that cover . incremental, we were able to achieve the same . ; bp long, >62% of all gaps were less than 1 kbj 
• the gap. We call this external gap "walking." { • * computation with a maximum of instantaneous : long, and no gap was >100 kbp long. Similar 

. We did not include the very aggressive "Peb- ^ usage of 28 gigabytes of RAM. Moreover, the > . ly, more than 65% of the sequence is in contig 
bles" substage described in our Drosophila - incremental nature of the first three stages al- ■ >30 kbp, more", than 31% is in contigs >10C 

; : , work, which made enough mistakes so ;as to £ ^; lowed us to continually update the state of this ' \ kbp, and the largest contig was L22 Mbp long, 
-produce repeat reconstructions for long inter- - ^ as data were delivered • Table 3, gives; detailed .summaiyi statistics fo: 

■ spersed elements whose quality was -.only ,^arod men perform, a;7-day. run Jo'complete Scaf-V : the cstracture ,oi>ihis ;assembly;'>w&;;a; ; "direct. 
: 99.62% correct. We decided mat fbr.the hu-^ de- ; .comparison to :me compartmentalized shotgun 

;man genome it was pklosopm^aUy better not ;vj sired assembly. ; 

•to introduce a step that was certain to produce • • compute infi^stmcture consists of 10 four-pro- 
• , jess . than .99.99% accuracy. The cost was a • cessor SMPs with 4 gigabytes of memory per 2.4 Compartmentalized shotgun 
. ; .somewhat larger number of gaps of some- : .c assembly 

what larger size. : : . processor NUMA.machine with 64 gigabytes i In addition to the WGA approach, we pur- 

At the final stage of the assembly process, ; of memory (Compaq's GS 160, Wildfire). The; sued a localized assembly approach that was 
and also at several intermediate points, .a U:total compute ,for a run of the, assembler was intended to subdi vide the /genome into seg- 

/consensus sequence of every contig is pro-/ ■.; roughly 20,000 CPU hours. r • ments, each of < which could be shotgun as- 

duced. Our algorithm is driven by the princi- * , • The , assembly of Celera's data, together ;sembled individually. We expected that this 
pie of maximum parsimony, with quality-: .c . , with the shredded bactig data, produced a set of ^ would help in resolution of large interchro- 
value-weighted measures for evaluating each scaffolds totaling 2.848 Gbp in span and con- mosomal duplications and improve the statis- 
base. The net effect is a Bayesian estimate of sisting of 2.586 Gbp of sequence. The chaff, or - tics, for calculating U-unitigs. The compart- 
the correct base to report, at each position; ,-'. : <r set of reads not. incorporated m mejassembty^ 

Consensus generation uses Celera dau when-, ■•numbered 11:27 million (26%), which is con- tering Celera reads and bactigs into large, 
ever it is present. In the event that no Celera -sistent with our experience for Drosophila. multiple megabase regions of the genome, 

-data cover a given region, the BAC, data ^More/than 84% of the genome was covered by v : and then running the WGA assembler on the 
sequence is used. . ^'scaffolds ^100. kbp long, and these. averaged Celera v data and shredded, faux reads ob- 

. A key element of achieving a WGA of the 91% sequence and 9% gaps with a total of tained from the bactig data, 
human genome was to parallelize the Overlap- 2.297 Gbp of sequence. There were a total of : r The first phase of me ,CS A strategy was to 
per and the central consensus sequencercon- ; ' ■■ 93,857 gaps among the 1637, scaffolds >100 - separate Celera reads, into Aose that matched 

J structing subroutines. In addition, memory was J Jcbp. The. average scaffold size was 1.5 Mbp, . the BAC contigs jfbr ;a particular ;PFP;BAC 
a . real issue — a straightforward application of arid me . en^ 

the software we had built for Drosophila would ^average gap size was 2.43 kbp, where the dis- , , data. ;Such matches; musti.be. guaranteed to 
Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 









Scaffold size 






AH 


>30 kbp 


>100kbp 


. .>50bkbp" ~ 


■* >1000 kbp 






Compartmentalized shotgun assembly 






No. of bp in scaffolds 


2,905,568,203 


2,748.892,430 


'2,700,489,906 


2.489.357.260 


2,248.689,128 


(including intrascaffotd gaps) 












No. of bp in contigs 


2,653,979.733 


2,524,251,302 


2,491,538,372 


2,320,648.201 


2,106,521,902 


No. of scaffolds 


53,591 


2,845 


1.935 


1.060 


721 


No. of contigs 


170,033 


112,207 


107.199 


93.138 


82.009 


No. of gaps 


116,442 


109,362 


105,264 


92,078 


81.288 


No. of gaps si kbp 


72,091 


69.175 


67.289 


59.915 


53.354 


Average scaffold size (bp) 


54,217 


966,219 


1.395.602 


2,348,450/ 


3.118,848 


Average contig size (bp) 


15,609 


22,496 


23,242 


24,916 


25.686 


Average intrascaffold gap size 


2,161 


2,054 


1.985 


1.832 


1,749 


(bp) 












Largest contig (bp) 


1.988,321 


1,988.321 


1.988.321 


1,988.321 


1,988.321 


% of total contigs 


100 


95 


94 


87 


79 






V/hole-genome assembly 








No. of bp in scaffolds 


2,847,890,390 


2,574,792.618 


2,525,334.447 


2.328.535.466 . 


2.140.943,032 


(including intrascaffold gaps) 










No. of bp in contigs 


2.586,634,108 


2.334,343,339 


2,297.678.935 


. 2.143.002.184 


1,983.305,432 


No. of scaffolds 


■ ". ' 118,968 


2,507 


1.637 


818 


554 


No. of contigs 


221.036 


99,189 


95.494 


84.641 


76.285 


No. of gaps 


102.068 


- 96,682 


93,857 


* 83.823 


75.731 


No. of gaps si kbp 


62.356 


60,343 


59.156 


54.079 


49,592 


Average scaffold size (bp) 


23.938 


1.027,041 


1.542.660 


2.846.620 


3.864,518 


Average contig size (bp) 


11.702 


23,534 


24.061 


25.319 


25.999 


Average intrascaffold gap size 


2.560 


2,487 


2,426 


2.213 


2,082 


(bp) 

Largest contig (bp) 


1,224.073 


1.224,073 


1.224.073 


1.224.073 


1.224.073 


% of total contigs 


100 


90 


89 


83 


77 
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The Overlapper compares every L i 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly. 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This, took 4 to 5 
days in elapsed time with 40 such machines '.. 
operating in parallel 

. Every overlap computed above is statisti- 
cally^ l-in-10 17 event and thus not a coinci- 
dental event. What makes assembly combi- 
.natorially. difficult is .that while many, over- ; 
laps , are actually sampled from overlapping , 
regions of the genome, and thus imply, that 
the sequence reads should be assembled to- 
. .gether,.even more overlaps are. actually from .i 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former. • 
"true overlaps", and the latter "repeat-induced 
overlaps." The assembler .must avoid choos-,. 
ing repeat-induced overlaps, especially early -\ 
in the process. 

We achieve this objective in the Unitig- . 
ger. We first find all assemblies of reads that 
appear to. be uncontested with respect to all • 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally, these unitigs are . 
the uncontested interval subgraphs of -the . 
graph of all overlaps (42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage 3epth is loVhlgh to 'be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct. In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger .was 
thus a set of correctly assembled subcontigs. 
covering an estimated 73.6% of the human 
genome. . The Scaffolder then proceeded to^ 
use mate-pair information to. link these to- 
gether into scaffolds. When there are two or'' 
more mate pairs that imply that a given pair 
.of U-unitigs , A are;. at a certain ^.distance and..-; 
orientation > with respect to each other, the., 
probability ■* of this being wrong is again 
. roughly ..1 in , 10 1 ?, assuming that mate pairs -J 
. . are . false less than 2% of the time. Thus, one £ 
■can : ^vith:high -confidence link together all ; 
U-iuiitigs that are linked by at least two 2- or 
10-kbp. mate .pairs producing intermediate- v 
^sized . .scaffolds that are . then . recursively : : 
. linked -together by , confirming , 50-kbp mate '■. 
pairs, and BAG. end, sequences. -This process . 
yielded scaffolds that are on the order of $ 
megabase pairs in size with gaps between 
..their contigs that generally correspond to re- 
petitiye elements and occasionally to small ., 
.-sequencing gaps. These scaffolds reconstruct r 
the majority of the unique sequence .within a )■ 
genome. 

For the Drosophila assembly, we engaged . 
in ; a three-stage repeat .resolution strategy v> 
where • each ; . stage .was . progressively .more .: f; 



5.11XCelera Reads 
39X mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
' • all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap/This was done -with the condition that 
: ':. two or more mate pairs with one of their 
'breads' already in the scaffold unambiguously 
-/.place the unitig in the given gap. .We estimate 
Avthe probability of inserting a unitig into an 
mcorrect gap with this strategy to be less than 
• } 1 0~? based on a probabilistic analysis. ; 

• ; We revised the ensuing "Stones'?, substage 
V of the human assembly, making it more like 
. v.the mechanism suggested in bur earlier work 
;<; (43); For each gap, every read R that is placed 

• :-:in the gap by virtue of its mated pair M being 

• in a contig of the scaffold and. implying R's 
/• placement is collected. Celera's mate-pairing, 
.c information is correct more than 99% of the ': 

•time. Thus, ^almost every, but not all, of the * 
breads in the set belong in the gap, and when - 
..--.a read does not belong it rarely agrees with 
the remainder of the. reads.- Therefore, we 
,S simply assemble this set of reads within the 
j gap, : -eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
\Drosophila assembly; in the. assembly of a . 
^simulated shotgun data set of human chromo- ^ 



Public Bactiqs 
(from 33.421 BACs) 




Bactigs & Cetera pairs 
V (binned by B AC) 




Components 1 




Components 2 




^ Components^ 





WGA Assembly CSA Assembly 

Fig. 4. Architecture of Celera's two-pronged assembly strategy. Each oval denotes a computation 
process performing the function indicated by its label, with the labels on arcs between ovals 
describing the nature of the objects produced and/or consumed by a process This figure 
summarizes the discussion In the text that defines the terms and phrases used 
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information was ignored because some BAG*, at least 22*** ^SSfflo 
, map and because we found strong evidence that > . possibly as a resuii 01 *u ? 
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Uvermore National Laboratory Cold Spring " a ^ r , U ^"^?n Sanford University: The Institute for Genomic 
Molekulare. Genetik: Japan Science and S^°|?prnvf UniveRity ofOHahoma; University of Texas 

Research; The Institute of Physlca and Chemical I eseard, ty ^ ^ ^ ^ 

Southwestern Medical Center. University of Washmgton. (The w>./ w 
shredded Into faux reads resulting In 2S6X coverage of the genome. 



•; :(see below). In short,- we performed a tmc, ab 
: 'initio whole-genome assembly in which wt 
- took me expedient of deriving additional x- 

quence coverage, but not mate pairs, assembled ; 

bactigs,or genome locality, from some cxtcr- 

• nally generated data. 

v > > • In the compartmentalized, shotgun assembly 
* : (CSA), Celera and PFP data were partitioned, 
v into the largest possible chromosomal segments 
or 'Components" mat could be determined with 
confidence, and then shotgun assembly was ap- 
• ; > plied to : each partitioned ..subset . wherein fhc 
••- . bactig data were again shredded into faux reads 
- v to ensure an independent ab initio assembly of 
..(he component By subsetting the data in this 

• way, (he overall computational effort was rc- 
> - duced and (he effect of interchrofnosomal dupli- 
■* cations was ameliorated. This also resulted in a 

reconstruction of the genome that was relatively 
t independent of the whole-genome assembly re- 
sults so that the two assemblies could be com- 
pared for consistency. The quality of the parti- 
v tioning -into : components .was ..crucial so ;thal 
different genome regions were not mixed to- 
aether We constructed components from (0 Uic 
•.loneest - scaffolds of the sequence from each 
BAC and (ii) assembled scaffolds of data un.quc 
: (o Celera's data set The BAC assemblies were 
v obtained by a combining assembler that used the 
; bactigs arid (he 5X . Celera data mapped to (hose 
) ,bactigs as mput'TMs effort was under^^^ <« 
l-:arimterim ; step solely because ^me'more accuru c . 
feand complete thescaffoldfora gtven sequence 
■ stretch/the more accurately one can tde these 
scaffolds into contiguous components on m. 
basis of sequence overlap and mate-pa.r into.- 
mation. We further visually inspected and cu- 
id (he scaffold tiling of the cor— «o 
further increase its accuracy. For the final CSA 
assembly, all but the partitiomng was^ ignored 
, and an independent, ab inrao reconstruct onoi 
, Z sequencS in each component was obtmnd 
hv aoDlving our whole-genome assembly nifc° 

Aeshredded faux reads of the partriioncd. ul 
evant bactig data. 

2 3 Whole-genome assembly 

The algorithms used for whole-genome «*- 
sVmblyW) of the 
enhancements to those used to PJ**, 
sequence of the DrosopMa genome rcpon 
in detail in (28). . , ilK 

The WGA assembler consists of a _P"P 
composed of five principal steges: Sera 

Overlapper, ^^ { ^%^U 
Resolver, respectively. The Screcntr 
and marks all microsatellrte repeats wrth 
(han a 6-bp element and screens ou 
known interspersed repeat e ejents.^ . 
ing Alu, Line, and nbosomal DMA. v* 
regions get searched for overlaps 
screened regions do not get searched but ^ 
be part of an overlap that mvolves unscrc 
matching segments. 
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and provide a comparison to the public gen' 
sequence, which was reconstructed largely v 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 

Shotgun sequence assembly is a classic 

• example of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 

' assembly algorithms :developed for Pro- 
sophila h&ve now been extended to assemble- 

, the —25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 

• ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by . using 

: known markers. The contigs consist of a col-, 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
-central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
complished by observing that a pair of reads, 
"one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
reads into the final set of reported scaffolds. 
This set of . unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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2.1 Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera.- This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 

. different donors. Libraries with insert sizes of 2, 
10, and 50 kbp were used By looking at how- 
mate pairs from a library were positioned in 

. known sequenced stretches of the genome, we 
were able to characterize the range of insert 
sizes, m each library and determine a mean and 

/ standard deviation; Table 1 details the number 
. of reads, sequencing coverage, and clone cov- 
erage achieved by the data set The clone cov- 

. V.erage is the coverage of the genome in cloned 
. DNA,. considering me entire insert of each 

: - clone that has sequence from both ends: The 

. clone j coverage provides a measure of the 
amount of physical DNA coverage of the ge- 

.v; nome! Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 51 X cover-, 
age of the genome, and clone coverage was 
3.42X, 16.40X, and 18.84X forthe 2-, 10-, and 
50-kbp- libraries, respectively, for a total of 
38.7 X clone coverage. 

- : .. The second data set jvas from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 

. = BAC data input to the assemblies came from a 
download of GenBank on 1 . September 2000 
(Table 2) totaling 4443.3 Mbp of sequence. 
.The data for each BAC is deposited at one of 

: four levels of completion. Phase 0 data are a set 
of .' generally ~ unassembled sequencing ! reads 1 

.".from a very light shotgun of the BAC, typically 

. less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 

STS 
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Gap (mean & std dev. Known) 
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— BAC Fragments 

Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and 
internally derived reads from five different individuals (black lines) are combined to produce a 
contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by using 
mate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) 
physical map information. 



( niences. In the past 2 years the PFP has 
aocused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
* ■ .from a 3X to 4X. light-shotgun of each BAC 
clone. 

. > We screened the bactig sequences for con- 
v- . taminants by . using the - BLAST algorithm 
y against three data sets: (i) vector sequences 
in Univec core (38), filtered for a 25-bp 
x /; match at 9 8% sequence identity at the ends . 
of the sequence and a 30-bp match ' 'internal V 

- to the sequencer (ii) the :nonhuman portion 
, rof the High -Throughput Genomic i(HTG) 

- v Seqences division of /GenBank : (39), ill-,. . 
v tered at 200 bp at 98%; and (iii) the non- 

^ : - redundant nucleotide sequences from Gen- 
:> Bank without primate and human virus en-. 
. tries, filtered at 200 bp at 98%. Whenever 

■ 25 bp or more of vector was found within 

■ 50 bp of the end of a contig, the tip up to :. 
7 the matching vector was excised. Under.. 
. these: criteria we removed 2.6 Mbp of pos- . 

sible contaminant and vector ; from the 
; Phase 3 data, 61.0 Mbp from the Phase 1 
: and 2 data, and 16.1 Mbp from the Phase 0 
. data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20%. 
finished, 75% rough-draft (Phase 1 and 2), 
. and 5% single sequencing reads (Phase 0). 
; An additional 1 04,0 1 8 B AC . end-sequence : 
mate pairs were also downloaded and in- 
. eluded in the data sets for both assembly . 
processes (18). ; . . . - 

2.2 Assembly strategies 

v Two different approaches to assembly were ~ 
. pursued. The first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering o£the bactigs. This 
resulted in 16.05 million "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2. 1 3% of them were 
misassembled (40). Furthermore, BAC location 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for. 
the algorithms described below. Procedural ; 
controls were established for maintaining 

•the validity of sequence mate-pairs as se- 
quencing reactions proceeded through thev 

• process, including strict rules built into the < 

ILIMS. The accuracy of sequence data pro- 
duced by . the Celera process was validated. 

; in the course of the Drosophila genome 
project (26). By collecting data for 1 the 
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entire human genome in a single facility,' 
we were able to ensure uniform quality 
r standards and the cost advantages associat- 
ed with automation, an economy of scale, , 
and process consistency. 

2 Genome Assembly Strategy and : .■> 
Characterization 

.' Summary. We describe in this section the . two 
; approaches that we used to assemble /the ge- 
v nome. One method involves the computational W 
combination of all sequence reads with shred- , 
ded data from GenBank to generate an indepen- * 



-dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the fiag- 
; ments to a region or chromosome on the basis 
, of -mapping information. The clustered data 
■ . were then shredded and subjected to computa- 
tional assembly/ Both approaches' provided es- 
sentially the same reconstruction of : assembled 
^. DNA sequence wim proper order and orienta- 
r,:tion.>:The} second 'method -provided .'slightly 
greater sequence coverage ^(fewer gaps) and 
; ;,was the principal sequence used for the analysis 
vphase. ; Jn addition, we . document the complete- 
ness and correctness of , this 'assembly process 
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Fig. 2. Flow diagram for sequencing pipeline. Samples are received, 
selected, and processed in compliance with standard operating proce- 
dures, with a focus on quality within and across departments. Each 
process has defined inputs and outputs with the capability to exchange 



samples and data with both internal and external entities according to 
defined quality guidelines. Manufacturing pipeline processes, products, 
quality control measures, and responsible parties are indicated and are 
described further in the text. 
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collected, as well as five specimens of ser' 
collected over a 6-week period. Permaiivut 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
from five subjects was selected for genomic 
DNA sequencing: two males and three fe- 
males — one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
'* 1304/DC1),' The decision of whose; DNA to 
•; sequence was based on a complex mix of fac-: 
'■ tors, including the goal of achieving diversity as 
well as technical issues such as the quality of 
. - the DNA libraries and availability of immortal- " 
ized cell lines. p < • 

1.1 Library construction and 
sequencing . . ■ , . 

.'■ Central to the whole-genome shotgun sequenc- 
: ing process is preparation of high-quality plas- . .. 
: mid libraries in a variety of insert sizes so that '. 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert. 
High-quality libraries have an equal representa- 
; tion of all parts of the genome, a small number 
:.' of clones without inserts, and no contamination 
-'« from such sources as the mitochondrial genome ! 
and Escherichia coli genomic DNA. DNA from * 
each donor was used to construct plasmid librar- 
ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (35). 

.In designing the DNA-sequencing pro- 
cess, we focused on developing a simple 
system that could be implemented in a robust * 
and reproducible-manner and monitored ef-: 
fectively (Fig. 2) (54). \ 
: '" K Current sequencing protocols are based on 



the dideoxy sequencing method (55), which 
typically yields only 500 to 750 bp of sequence 
per reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic . 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence \ 
data continuously at a rate of 175,000 total , 
reads per day. The DNA-sequencing facility is 
v .^supported by a high-performance computation-:: 
al facility (36) ; ••" v*. 

,r The process for DNA sequencing was mod- 
: .v ular by design and automated. Intermodule 
';• -sample -backlogs -allowed four principal ■ 
; : ^modules to ^operate independently: • (i) • li-v> 
L ibrary, transformation,'. plating;: and colony . 
-picking; (ii) * DNA template preparation; 
(iii) dideoxy .,' sequencing reaction set-up * 
■ and purification; and : (iv) sequence deter- > 
: mination with the ABI PRISM 3700 DNA 
.-• Analyzer; Because the Inputs and outputs 

• of each module have been carefully V 
. • matched and sample backlogs are continu- ... 

ously managed, sequencing has proceeded. v 
' without a single day's interruption since the 
. initiation of the Drosophila project in May 
Vl999/The"ABI 3700 is a fully automated - ; 
.. capillary array sequencer and as such can > 
be operated ? with a minimal amount off 
^ hands-on time^ currently estimated at about r 
15 min per day. The capillary system also 

• facilitates correct, associations of sequenc- 
ing traces with samples through the elimi- 
nation of manual sample loading and lane- 

.: tracking errors associated with slab gels: 
- About 65 production staff were hired and 
.' trained, and were rotated on a regular basis >- 



rough the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
. unique bar code identifiers. The facility was 
\ v supported by a quality. control team thai per- 
; formed raw material and in-process testing 
: - and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing.of the facility. Critical to 
- the success of the scale-up was the validation 
ft of ' all software and .instrumentation, before 
implementation, and production-scale testing 
.of any process changes ~ ,.. 

1.2 Trace processing 

? An automated trace-processing pipeline - has 
*'been developed to process each sequence file 
K(37). After quality and vector-trirnming, the . 
average trimmed -sequence length was 543 
bp,: and the . sequencing , accuracy . was expo- 
nentially distributed with a mean of 99.5% 
" and with less than ;1 in i 000 reads being less 
than 98% accurate (2d)rEach trimmed se- 
quence was screened for matches to contam- 
: inants including sequences of vector alone, E. 
co/rgenomic DNA; and human mitochondri- . 
•* al DNA. .The entire .read. for . any sequence 
- with a significant match to a contaminant was ;-■ 
. discarded. A total of 713 reads matched E , 
• coli genomic DNA and 2114 reads matched . 
. the human mitochondrial genome. , . ... 

1.3 Quality assessment and, control 

The importance of the base-pair, level ;ac- [ . 
•curacy of the sequence data increases as the 
size and repetitive nature of the genome to 
■be sequenced - increases. /} Each - sequence - 
read must be placed uniquely in the ge r 



Table 1. Cetera-generated data input into assembly. 





Individual 




Number of reads for different insert libraries 




Total number of 




2 kbp 


10 kbp 


50 kbp 


Total 


base pairs - 


No. of sequencing reads 


A 


0 


0 


2.767,357 


2.767357 • 


1.502,674,851 




B 


11,736,757 


7.467,755 


66,930 


19.271,442 


10,464,393,006 




C 


853,819 


881,290 


0 


1,735.109 


942,164,187 




D 


952,523 


1.046,815 


0 


1,999,338 


1,085,640,534 




F 


0 


1,498,607 


0 


1.498,607 


813,743,601 




Total 


13.543,099 


10.894.467 


2,834,287 


27.271,853 


14,808,616,179 


Fold sequence coverage 


A 


0 


0 


0.52 


0.52 




(2.9-Cb genome) 


B 


2.20 


. 1.40 


0.01 


. 3.61 






C 


0.16 


1.17 


0 


0.32 






D 


0.18 


0.20 


0 


0.37 






F 


0 


. 0.28 


0 


0.28 






Total 


2.54 


2.04 


0.53 


5.11 




Fold clone coverage 


A 


0 


0 


18.39 


18.39 






B 


2.96 


11.26 


0.44 


14.67 






C 


0.22 


1.33 


0 


1.54 






D 


0.24 


1.58 


0 


1.82 






F 


0 


2.26 


0 


2.26 






Total 


3.42 


16.43 


18.84 


38.68 




Insert size* (mean) 


Average 


1,951 bp 


10.800 bp 


50,715 bp 






Insert size* (SD) 


Average 


6.10% 


8.10% 


14.90% 






% Matesf 


Average 


74.50 


80.80 


75.60 







•Insert size and SO are calculated from assembly of mates on contigs. f% Mates is based on laboratory tracking of sequencing runs. 
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t^ously map and sequence the human ge- 
ome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(17 \ 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the . 
: BAC end-sequencing (BES) method was ap- 
v. plied successfully to complete chromosome 2 : 
: from the Arabidopsis thaliana genome {19). 
In 1 997, Weber and Myers (20) proposed 
whole-genome shotgun ; sequencing of the 
r: human genome. Their proposal was not well 
:; received (27). However, by early 1998, as 
less than 5% of the genome had been se- 
; quenced, it was clear, that the rate of progress 
/ in 'human genome sequencing -worldwide 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, - high- : 
throughput capillary DNA . sequencer, -subse- 
quently called the ABI PRISM 3700 DNA' 
. Analyzer. Discussions between PE Biosystems 
. and TIGR scientists resulted in a plan to under- 
. take the sequencing of the human genome with 
y. '; the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
|TIGR (23). Many of the principles of operation 
T^f a genome-sequencing facility were . estab- 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
^to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to —5-fold 
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coverage and to use the unordered and unori- 
ented BAC sequence fragments and subassem- 
blies published in GenBank by . the publicly 
funded genome effort (30) to accelerate the , 
project We also abandoned the quarterly an- ; 

, nouncements in the absence of interim assem- V; 

rblies to report. 

*i w :Although:this strategy provided a reason-; 
. vable result very early that was consistent with a 
whole-genome.; shotgun assembly with eight- 
V; fold coverage, the human genome sequence is > 
S not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
• > rately ordered and oriented scaffold sequence of 
: ithe human genome in less than 1 year. Human 
- genome sequencing was initiated 8 September 
.1999 and completed 17. June 2000. The first 
assembly was completed 25 June 2000, and the 
... assembly reported here was completed 1 Octo-- 
: ber 2000. Here we describe the whole-genome 
. random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
. ent assembly approaches for assembling the ~3 
billion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
> Bank-derived data were shredded to remove 
potential bias to the final sequence from chi-: 
: meric clones, 1 foreign DNA contamination, or 
;misassemb!ed contigs. Insofar as a correctly * 
and accurately ! assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe.our 
preliminary analysis of the human genetic 
code on the basis of computational methods. : 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
5507/1304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1. Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods 8 

Summary. This section discusses the rationale 
.and ethical rules governing donor selection to 

> ensure ethnic and gender diversity along witf, 
; : ;,the methodologies for.DNA extraction and 

: •brary .vcohstruction. The plasmid library con- 
y t struction sis the :first critical step in shotgun v 
sequencing^ If the DNA libraries are not unj. ■'* 

> , form in size, nonchimeric, and do not randomly ' 
^represent the genome, then the subsequent steps ' 

.. cannot accurately reconstruct .the genome se- 
quence. We used automated high-throughput 
. DNA sequencing and the computational infra- 
^stmcture. to; enable efficient; tracking of cnor*\ 
.> mous amounts of sequence information (27.3 ■ 
: :. million sequence, reads; 14.9 billion bp of sc- ""' 
quence). Sequencing and tracking from both * 
. . ends of plasmid clones from 2-, 10-; and 50-kbp 
- libraries were -.essential to the computational 
. reconstruction of the genome. Our evidence 
a indicates that the accurate pairing , rate of end : 
. sequences was greater than 98%. : 

Various policies of the United States and the 
. v World Medical Association; specifically the 
Declaration of Helsinki, offer recommenda- 
• tions for conducting experiments with human 
. subjects.:, We convened. an Institutional Re-. 
; wiew. Board. QBE) (31) that helped us estab- 
r lish the protocol for obtaining and using hu- 
man DNA and me informed consent process 
. used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
■ . ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
. mens and records, circumscribed contact with 
the subjects by . researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African- American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, -130 ml of whole, 
heparinized blood was collected. From males, 
- 130 ml of whole, heparinized blood was 
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A 2.91 -billion base pair (bp) consi .. J sequence of the euchromatic portion of 
the human genome was generated by the whole-genome shotgun sequencing 
method. The 14.8-billion bp DNA sequence was generated over 9 months from 
27,271,853 high-quality sequence reads (5.1 1-fold coverage of the genome) 
from both ends of plasmid clones made from the DNA of five individuals. Two 
assembly strategies — a whole-genome assembly and a regional chromosome 
assembly— were used, each combining sequence data from Celera and the 
publicly funded genome effort. The public data were shredded into 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
sequenced, without including biases inherent in the cloning and assembly 
procedure used by the publicly funded group^ This brought the effective cov- 
erage in the assemblies to eightfold, reducing the number and size of gaps in 
the final assembly over what would be obtained with 5.1 1-fold coverage. The 
two assembly strategies yielded very similar results that largely agree with 
independent mapping data. The assemblies effectively cover the euchromatic 
regions of the human chromosomes. More than 90% of the genome is in • 
scaffold assemblies of 100,000 bp or more/and 25% of the genome is in 
scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 
26,588 protein-encoding transcripts for which there was strong corroborating 
evidence and an additional -12,000 computationally derived genes with mouse 
matches or other weak supporting evidence. Although gene-dense clusters are 
obvious, almost half the genes are dispersed in low G+C sequence separated 
by large tracts of apparently noncoding sequence. Only 1.1% of the genome 
is spanned by exons, whereas 24% is in introns, with 75% of the genome being 
intergenic DNA. Duplications of segmental blocks, ranging in size up to chro- 
mosomal lengths, are abundant throughout the genome and reveal a complex 
evolutionary history. Comparative genomic analysis indicates vertebrate ex- 
pansions of genes associated with neuronal function, with tissue-specific de- 
velopmental regulation, and with the hemostasis and immune systems. DNA 
sequence comparisons between the consensus sequence and publicly funded 
genome data provided locations of 2.1 million single-nucleotide polymorphisms 
(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 
1250 on average, but there was marked heterogeneity jn the level of poly- 
morphism across the genome. Less than 1% of all SNPs resulted in variation in . 
proteins; but the task of determining which SNPs have functional consequences 
remains an open challenge. . T . . : 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution, the causation 
of disease, and the interplay between , the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 
quence of the human genome was- first for- 
mally proposed in 1985 (7). In subsequent 
years, the idea met with mixed reactions in 
the scientific community (2). However, in 
1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. 

The modern history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for determining the order of nucleotides of 



A A using chain-terminating nucleotide ana- 
logs (3). In the same year, the first human gene 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides, 
which permitted them to be sequentially read 1 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successful 
when the sequences of two genes were obtained 

~ with this new technology (6). From early' se-. 
quencing of human genomic regions . (7), it 
became clear that cDNA sequences (which are 

- reverse-transcribed from RNA) would be es- 

- sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of . the ex- 
pressed sequence tag (EST) method of gene 
identification (5), which is a random selection, . 
very high throughput sequencing approach to 

- characterize cDNA libraries. -The EST method 
led to the rapid discovery and mapping of hu- 

* man genes (9). The. increasing numbers of hu- 
. man EST sequences necessitated the develop- 
ment of new computer algorithms to analyze 
large amounts of sequence data, and in 1993 at 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
> bly and analysis of hundreds of thousands of 
ESTs. This algorithm, permitted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies (10). 

The complete 49-kbp bacteriophage lamb- 
ida genome sequence was determined vby a 
. shotgun : restriction ' digest method in 1982 
(11). When considering methods for sequenc- 
ing the smallpox virus genome in 1991 (12), 
. a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994, 
when a microbial genome-sequencing project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
. possible with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(13). The experience with several subsequent 
genome-sequencing efforts, established the 
broad applicability of this approach (14, IS). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in length from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes led to the 
suggestion (16) of an approach to simulta- 
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